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In  this  dissertation,  a  general  way  of  constructing  optimal  generalized  estimating 
equations  (GEE)  is  given.  Applications  of  this  general  method  to  various  statistical 
problems,  such  as  the  quasi-likelihood  method  in  generalized  linear  models,  Cox's 
partial  likelihood  method  in  survival  analysis,  Bayesian  inference,  conditional  and 
marginal  inferences,  are  also  studied.  Also,  some  simple  results  about  matrix  valued 
convex  functions  are  proved  and  are  applied  to  the  study  of  optimal  designs,  mixture 
distributions  and  asymptotic  minimaxity. 

First,  a  notion  of  generalized  inner  product  spaces  is  introduced  to  study  optimal 
estimating  functions.  A  characterization  of  orthogonal  projections  in  generalized 
inner  product  spaces  is  given.  It  is  shown  that  the  orthogonal  projection  of  the  score 
function  into  a  linear  subspace  of  estimating  functions  is  optimal  in  that  subspace,  and 
a  characterization  of  optimal  estimating  functions  is  given.  Also  optimal  estimating 
functions  in  the  Bayesian  framework  are  also  studied. 
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In  the  case  of  no  nuisance  parameters,  the  results  are  applied  to  study  longitudinal 
data,  stochastic  processes,  time  series  models,  generalized  linear  models  and  Bayesian 
inference.  As  special  cases  of  the  main  results  of  this  chapter,  we  derive  the  results 
of  Godambe  on  the  foundation  of  estimation  in  stochastic  processes,  the  result  of 
Godambe  and  Thompson  on  the  extension  of  quasi-likelihood,  and  the  linear  (and 
quadratic)  generalized  estimating  equations  for  multivariate  data  due  to  Liang  and 
Zeger,  Liang,  Zeger  and  Qaqish.  Also  we  have  derived  optimal  Bayesian  estimating 
equations  in  the  Bayesian  framework. 

In  the  case  where  there  are  nuisance  parameters,  the  results  are  applied  to  study 
survival  analysis  models,  the  generalized  estimating  equations  proposed  by  Liang, 
Zeger  and  their  associates,  and  the  optimality  of  the  marginal  and  conditional  infer- 
ences. The  three  main  topics  are  (A)  globally  optimal  generalized  estimating  equa- 
tions; (B)  locally  optimal  generalized  estimating  equations;  (C)  conditionally  optimal 
generalized  estimating  equations.  A  general  result  is  derived  in  each  case.  As  spe- 
cial cases,  we  rederive  some  of  the  results  already  available  in  the  literature  and  find 
also  some  new  results.  In  particular,  as  special  cases  of  our  result  on  globally  opti- 
mal generalized  estimating  equations,  we  find  the  results  of  Godambe  and  Thompson 
and  Godambe  with  nuisance  parameters.  The  results  of  Bhapkar  on  conditional  and 
marginal  inference  are  also  obtained  as  special  cases.  As  applications  of  our  result  on 
locally  optimal  generalized  estimating  equations,  we  find  Lindsay's  result  on  the  opti- 
mality of  conditional  score  functions,  extend  Godambe's  result  on  optimal  estimating 
functions  for  stochastic  processes  to  nuisance  parameters,  and  extend  a  recent  re- 
sult of  Murphy  and  Li  about  projected  partial  likelihood.  Finally,  our  general  result 
on  conditionally  optimal  generalized  estimating  equation  helps  generalize  the  find- 
ings of  Godambe  and  Thompson  to  situations  which  admit  the  presence  of  nuisance 
parameters. 
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Finally,  some  simple  results  for  matrix  valued  convex  functions  are  proved,  and 
are  used  to  find  optimum  experimental  designs,  the  fundamental  theorem  of  mixture 
distributions,  and  a  generalization  of  the  asymptotic  result  of  Huber. 
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CHAPTER  1 
INTRODUCTION 

1 . 1  Preamble 

The  objective  of  this  thesis  is  to  provide  a  geometric  insight  behind  many  useful 
concepts  in  statistics  and  utilize  the  geometry  for  unifying  many  existing  results,  as 
well  as  in  deriving  several  new  ones.  One  major  focus  is  to  find  optimal  estimating 
functions  as  orthogonal  projections  of  score  functions  into  appropriate  linear  sub- 
spaces.  The  second  goal  is  to  use  some  important  theorems  from  convex  analysis  for 
finding  optimal  experimental  designs,  for  deriving  the  fundamental  theorem  of  mix- 
ture distributions,  and  for  proving  the  asymptotic  minimaxity  of  estimating  functions 
in  a  very  general  framework. 

1.2    Literature  Review 

We  begin  by  reviewing  the  literature  on  estimating  functions.  The  topic  has 
grown  into  an  active  research  area  over  the  past  decade.  Its  beginning  is  marked 
with  the  celebrated  articles  of  Godambe  (1960)  and  Durbin  (1960).  While  Durbin 
(1960)  used  estimating  functions  to  study  Gauss-Markov  type  results  in  a  time  series 
setting,  Godambe's  (1960)  main  objective  was  to  prove  the  optimality  of  the  score 
function  in  a  parametric  framework  when  there  were  no  nuisance  parameters.  As 
is  well-known,  the  Gauss-Markov  theory  and  maximum  likelihood  estimation  form 
two  cornerstones  of  statistical  estimation.  In  their  review  article,  Godambe  and  Kale 
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(1991)  have  pointed  out  that  the  theory  of  estimating  functions  combines  the  strengths 
of  these  two  methods,  eliminating  at  the  same  time  many  of  their  weaknesses.  To  cite 
an  example,  Gauss-Markov  theorem  fails  for  nonlinear  least  squares,  but  estimators 
obtained  as  solutions  of  optimal  estimating  equations  are  identical  to  the  least  squares 
estimators  under  homoscedasticity. 

The  theory  of  estimating  functions  has  made  rapid  strides  since  the  1970s.  Go- 
dambe  and  Thompson  (1974),  Godambe  (1976)  studied  optimal  estimating  functions 
in  the  presence  of  nuisance  parameters,  and  proved  a  variety  of  optimality  results. 
Bhapkar  (1972,  1989,  1991),  Bhapkar  and  Srinivasan  (1994)  in  a  series  of  articles 
studied  the  notions  of  sufficiency,  ancillarity  and  information  in  the  context  of  es- 
timating functions,  and  found  conditional  as  well  as  marginal  optimal  estimating 
functions.  Amari  and  Kumon  (1988),  and  Kumon  and  Amari  (1984)  used  estimat- 
ing functions  to  estimate  structural  parameters  in  the  presence  of  a  large  number 
of  nuisance  parameters,  their  approach  being  based  on  vector  bundle  theory  from 
differential  geometry. 

Nelder  and  Wedderburn  (1971),  in  their  pioneering  paper  on  generalized  linear 
models,  showed  that  using  one  algorithm  (the  Newton-Raphson  method),  a  large 
family  of  models  could  be  iteratively  fitted.  Later,  Wedderburn  (1974)  realized  that 
only  the  first  two  moments  were  utilized  in  fitting  the  models,  and  this  led  to  the 
development  of  the  so-called  quasi-likelihood  functions  for  the  development  of  gen- 
eralized linear  models.  Firth  (1987),  and  Godambe  and  Thompson  (1989)  pointed 
out  the  connection  between  quasi-likelihood  and  optimal  estimating  functions.  An 
interesting  review  article  is  due  to  Desmond  (1991). 

Cox  (1972),  in  his  seminal  paper,  introduced  the  proportional  hazards  model. 
Later,  Cox  (1975)  introduced  the  notion  of  partial  likelihood.  The  latter  is  intended  to 
eliminate  nuisance  parameters  (baseline  hazards  for  the  proportional  hazards  model) 
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by  using  a  conditioning  argument.  Because  of  the  nested  structure  of  the  conditioning 
variables,  Cox's  approach  also  fits  into  the  estimating  function  framework. 

Liang  and  Zeger  (1986)  used  estimating  functions  (they  used  the  terminology 
generalized  estimating  equations)  to  study  longitudinal  data.  Liang  and  Zeger  had 
motivation  similar  to  Wedderburn's  quasi-likelihood  function,  but  in  the  multivariate 
setting  in  order  to  take  into  account  the  correlation  between  responses  within  each 
subject. 

Bayesian  estimating  function  is  of  more  recent  origin  and  is  still  in  its  infancy. 
Ferreira  (1981,  1982)  and  Ghosh  (1990)  intiated  the  study  of  optimal  estimating 
functions  in  a  Bayesian  framework.  While  Ferreira's  formulation  involves  the  joint 
distribution  of  the  observations  and  the  parameters,  Ghosh  used  a  pure  Bayesian 
approach  based  only  on  the  posterior  probability  density  function. 

The  theory  of  optimum  experimental  designs,  it  was  initiated  by  Elfving  (1952, 
1959),  and  Kiefer  (1959).  For  the  references  up  to  the  early  eighties,  we  refer  to  the 
two  monographs  of  Silvey  (1980)  and  Pazman  (1986).  During  the  last  decade,  there 
are  major  advances  in  optimum  experimental  design  theory,  here  we  only  list  a  few  of 
the  main  publications.  Chaloner  and  Larntz  (1989)  studied  optimal  Bayesian  design 
for  logistic  regression  model,  El-Krunz  and  Studden  (1991)  studied  Bayesian  optimal 
design  for  linear  regression  models,  while  DasGupta  and  Studden  (1991)  studied 
robust  Bayesian  experimental  designs  for  normal  linear  models.  Dette  and  Studden 
(1993)  studied  the  geometry  of  ^-optimal  design,  while  Dette  (1993)  studied  the 
geometry  of  D-optimal  design,  and  Haines  (1995)  studied  the  geometry  of  Bayesian 
designs. 

There  is  a  vast  literature  on  the  theory  of  mixture  distributions.  Laird  (1978)  stud- 
ied nonparametric  maximum  likelihood  estimation  of  a  mixing  distribution.  Lindsay 
(1981,  1983a,  1983b)  studied  the  properties  and  geometry  of  maximum  likelihood 
estimator  of  mixing  distribution.  In  a  recent  monograph,  Lindsay  (1995)  presented 
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a  comprehensive  treatment  of  "mixture  models:  theory,  geometry  and  applications." 
In  this  book,  a  variety  of  topics  about  mixture  distributions  were  discussed,  which 
include  the  well  known  result  proved  by  Shaked  (1980)  on  mixtures  from  the  exponen- 
tial family,  and  the  fundamental  theorem  on  mixture  distributions  proved  by  Lindsay 
(1983a). 

Huber  (1964)  in  his  pioneering  paper,  proved  the  well  known  asymptotic  minimax- 
ity  result  for  estimating  functions  about  location  parameter.  In  his  classical  book  on 
robust  statistics,  Huber  (1980)  presented  a  more  systematic  treatment  about  asymp- 
totic minimaxity. 

1.3    The  Subject  of  the  Dissertation 

This  dissertation  begins  with  unfolding  the  geometry  of  estimating  functions,  and 
pointing  out  many  applications.  Although  the  geometry  is  primarily  used  to  study 
estimating  functions,  this  can  also  be  used  to  study  other  statistical  topics,  such  as  the 
Rao-Blackwell  theorem,  Lehmann-Scheffe's  approach  to  uniform  minimum  variance 
unbiased  estimators  and  predication  theory. 

In  Chapter  2,  a  notion  of  generalized  inner  product  spaces  is  introduced  to  study 
optimal  estimating  functions.  A  characterization  of  orthogonal  projections  in  gener- 
alized inner  product  spaces  is  given.  It  is  shown  that  the  orthogonal  projection  of 
the  score  function  into  a  linear  subspace  of  estimating  functions  is  optimal  in  that 
subspace  and  a  characterization  of  optimal  estimating  functions  are  given.  As  spe- 
cial cases  of  the  main  results  of  this  paper,  we  derive  the  results  of  Godambe  (1985) 
on  the  foundation  of  estimation  in  stochastic  processes,  the  result  of  Godambe  and 
Thompson  (1989)  on  the  extension  of  quasi-likelihood,  and  the  generalized  estimating 
equations  for  multivariate  data  due  to  Liang  and  Zeger  (1986).  Also  we  have  derived 
optimal  estimating  functions  in  the  Bayesian  framework.  This  generalizes  the  results 
obtained  by  Ferreira  (1981,  1982)  and  Ghosh  (1990). 


5 


In  Chapter  3,  the  geometry  of  estimating  functions  in  the  presence  of  nuisance 
parameters  is  studied.  The  three  main  topics  are:  (A)  globally  optimal  estimat- 
ing functions;  (B)  locally  optimal  estimating  functions;  (C)  conditionally  optimal 
estimating  functions.  A  general  result  is  derived  in  each  case.  As  special  cases,  we 
rederive  some  of  the  results  already  available  in  the  literature,  and  find  also  some  new 
results.  In  particular,  as  special  cases  of  our  result  on  globally  optimal  estimating 
functions,  we  find  the  results  of  Godambe  and  Thompson  (1974)  and  Godambe  (1976) 
with  nuisance  parameters.  The  results  of  Bhapkar  (1989,  1991a)  on  conditional  and 
marginal  inference  are  also  obtained  as  special  cases.  As  applications  of  our  result  on 
locally  optimal  estimating  functions,  we  find  Lindsay's  (1982)  result  on  the  optimality 
of  conditional  score  functions,  extend  Godambe's  (1985)  result  on  optimal  estimat- 
ing functions  for  stochastic  processes,  and  extend  a  recent  result  of  Murphy  and  Li 
(1995)  about  projected  partial  likelihood.  Finally,  our  general  result  on  conditionally 
optimal  estimating  function  helps  generalize  the  findings  of  Godambe  and  Thompson 
(1989)  to  situations  which  admit  the  presence  of  nuisance  parameters. 

In  Chapter  4,  we  first  prove  some  general  results  about  convexity,  and  then  apply 
the  results  to  various  statistical  problems,  which  include  the  theory  of  optimum  ex- 
perimental designs  (Silvey,  1980),  the  fundamental  theorem  of  mixture  distributions 
due  to  Lindsay  (1983a),  and  the  asymptotic  minimaxity  of  robust  estimation  due 
to  Huber  (1964).  In  his  classical  paper  on  M-estimation,  Huber  (1964)  proved  an 
asymptotic  minimaxity  result  for  estimating  functions  about  a  location  parameter. 
In  this  chapter,  this  fundamental  result  is  generalized  to  general  estimating  functions. 
The  geometric  optimality  of  estimating  functions  proved  in  Chapter  2  will  be  used  to 
prove  a  necessary  and  sufficient  condition  for  the  asymptotic  minimaxity  of  estimating 
functions  when  the  parameter  space  is  multi-dimensional. 

In  Chapter  5,  we  summarize  the  results  of  this  dissertation,  and  propose  some 
topics  of  future  research. 


CHAPTER  2 
THE  GEOMETRY  OF  ESTIMATING  FUNCTIONS  I 

2.1  Introduction 

The  theory  of  estimating  functions  has  advanced  quite  rapidly  over  the  past  two 
decades.  Godambe  (1960)  introduced  the  subject  to  prove  finite  sample  optimality 
of  the  score  function  in  a  parametric  framework  when  no  nuisance  parameters  were 
presented.  Later,  his  idea  was  extended  in  many  different  directions,  and  optimal 
estimating  functions  were  derived  under  many  different  formulations. 

The  underlying  thread  in  all  these  results  is  a  geometric  phenomenon  which  seems 
to  have  gone  unnoticed,  or  at  least  has  never  been  brought  out  explicitly.  In  the 
present  chapter,  we  make  this  geometry  explicit,  and  use  the  same  in  deriving  opti- 
mal estimating  functions  in  certain  contexts.  In  particular,  it  is  shown  that  optimal 
estimating  functions  for  certain  semiparametric  models  are  indeed  the  orthogonal 
projections  of  score  functions  into  certain  linear  subspaces.  Also,  this  geometry,  by 
its  very  nature,  is  neutral,  and  can  be  adapted  both  within  the  frequentist,  and  the 
Bayesian  paradigm.  Second,  the  multiparameter  situation  can  be  handled  automati- 
cally through  this  geometry  without  involving  any  additional  work. 

The  outline  of  the  remaining  sections  is  as  follows.  In  Section  2.2,  we  develop 
the  mathematical  prerequisite  for  the  results  of  the  subsequent  sections.  In  particu- 
lar, we  define  generalized  inner  product  spaces,  and  show  the  existence  of  orthogonal 
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projections  of  elements  in  these  spaces  into  some  linear  subspaces.  A  characteriza- 
tion theorem  for  these  orthogonal  projections  is  given,  which  is  used  repeatedly  in 
subsequent  sections. 

Section  2.3  generalizes  the  results  of  Godambe  (1985)  and  Godmabe  and  Thomp- 
son (1989)  in  multiparameter  situations  and  also  finds  optimal  generalized  estimating 
equations  (GEEs)  for  multivariate  data.  The  GEEs  used  in  Liang  and  Zeger  (1986) 
and  Liang,  Zeger  and  Qaqish  (1992)  turn  out  to  be  special  cases  of  those  proposed 
in  this  section.  The  common  thread  in  the  derivation  of  all  the  optimal  estimating 
functions  is  the  idea  of  orthogonal  projection  developed  in  section  2.2. 

Section  2.4  uses  the  orthogonal  projection  idea  in  deriving  optimal  Bayes  estimat- 
ing functions.  The  results  of  Ferreira  (1981,  1982)  and  Ghosh  (1990)  are  included 
as  special  cases.  Section  2.5  uses  an  orthogonal  decomposition  to  study  informa- 
tion inequality.  Section  2.6  studies  the  Hoeffding  type  decomposition  for  estimating 
functions. 

2.2    Generalized  Inner  Product  Spaces  and  Orthogonal  Projection 

In  this  section,  we  first  introduce  a  matrix  version  of  inner  product  spaces  which 
generalizes  the  notion  of  the  usual  scalar  valued  inner  product  space.  Next  we  provide 
the  definition  of  the  orthogonal  projection  of  an  element  of  a  generalized  inner  product 
space,  say,  L  into  a  linear  subspace  L0  of  L.  A  characterization  of  the  orthogonal 
projection  in  the  generalized  inner  product  space  is  also  given.  As  will  be  seen,  such 
a  characterization  generalizes  a  corresponding  result  for  scalar  inner  product  spaces. 
We  also  show  that  for  a  finite  dimensional  subspace  of  a  generalized  inner  product 
space,  an  orthogonal  projection  always  exists. 

We  begin  with  the  definition  of  a  matrix  valued  inner  product  space. 
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Definition  2.2.1.  Let  L  be  a  real  linear  space,  and  let  Mkxk  ^e  the  set  of  all 
k  x  k  real  matrices.  The  map 

<.,.>:  L  x  L  — ►  Mkxk, 

is  called  a  generalized  inner  product  if 

(1)  Vx,y£L,   <  x,y  >=<  y,x 

(2)  for  any  k  x  k  matrix  M,x,y  e  L,   <  M  x,y  >=  M  <  x,  y  >; 

(3)  V  x,  y,  z  G  L,  <  x,  y  +  z  >=<  x,y  >  +  <  x,  z  >; 

(4)  V  x  G  L,  <  .t,.t  >  is  non  negative  definite  (n.  n.  d.),  and  <  x,x  >=  0  iff 
i  =  0. 

Two  elements  x,y  G  L  are  said  to  be  orthogonal  if  <  x,  y  >=  0.  Two  sets  Si,  S? 
are  orthogonal  if  every  element  of  Si  is  orthogonal  to  every  element  of  52- 

An  example  of  a  generalized  inner  product  space,  of  great  interest  to  statisticians 
is  the  one  where  the  generalized  inner  product  is  defined  by  the  covariance  matrix 
of  random  vectors.  Specifically,  let  X  be  a  sample  space,  and  let  0  C  Rh  be  the 
parameter  space,  which  is  open.  Consider  the  space  L  of  all  functions 

h  :  X  x  6  — >  Rk, 

such  that  every  element  of  the  matrix  E[h(X,9)h{X,6)t\9\  is  finite.  For  any  h,g£ 
L,  9  G  0,  the  family  of  generalized  inner  products  is  defined  by 

<h,g>e=  E[h(X,9)  g(X,9Y\9]. 

Then  it  is  easy  to  verify  that  for  fixed  9  G  0,  <  ., .  >g  is  a  generalized  inner  product 
on  L. 

Definition  2.2.2.  Let  L  be  a  generalized  inner  product  space  with  inner  product 
<.,.>.  Suppose  Lq  is  a  linear  subspace  of  L.  Let  s  G  L.  An  element  yo  G  L0  is 
called  the  orthogonal  projection  of  s  into  L0  if 

<  s  -  y0,  s  -  y0  >=  min  <  s  -  y,s  -  y  >,  (2.2.1) 
yeL0 
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where  min  is  taken  with  respect  to  the  usual  ordering  of  matrices.  More  specifically, 
for  two  square  matrices  A  and  B  of  the  same  order,  we  say  that  A  >  B  if  A  —  B  is 
n.  n.  d. 

The  following  theorem  characterizes  the  orthogonal  projection  in  generalized  inner 
product  spaces. 

Theorem  2.2.1.  Let  L  be  a  generalized  inner  product  space  with  inner  product 
<  ., .  >,  and  L0  be  a  linear  subspace  of  L.  Let  s  G  L.  Then  y0  G  L0  is  the  orthogonal 
projection  of  s  into  L0  if  and  only  if 

<s  -2/0,2/  >=0,  (2.2.2) 

for  all  y  G  Lo,  i.  e.,  s  —  y0  and  L0  are  orthogonal.  Furthermore,  if  the  orthogonal 
projection  exists,  then  it  is  unique. 

Proof.  Only  if.  For  all  y  G  L0,  a  £  R,  since  y0  —  ay  G  L0, 

<  s  -  y0  +  ay,  s  -  y0  +  ay  >  -  <  s  -  y0,  s  -  y0  > 

is  n.  n.  d.,  i.  e., 

a[<  s-y0,y  >  +  <  s  -  yQ,  y  >']  +  a2  <  y,  y  >  (2.2.3) 

is  n.  n.  d.,  for  all  y  G  L0,a  G  R. 

Now  suppose  that  there  exists  yl  G  L0  such  that  <  s  —  yo,  y^  >^  0.  Let 

A  =<  s  -  y0,  yo  >  +  <  s  -  y0,  y0*  >'  . 

Then  A  is  real  symmetric,  and  A  ^  0.  Suppose  Ai, . . . ,  A*  are  the  eigenvalues  of  A 
with  |Ai|  >  . . .  >  |Ajt|.  Denote  by  Z\  the  unit  eigenvector  corresponding  to  Ai.  Then 
from  (3),  using  z\Az\  =  Ai, 

aAi  +  a2z[  <  y0,  y0  >  zx  >  0, 
for  all  a  G  R.  This  implies  that  Ai  =  0.  So  A  =  0,  a  contradiction.  Hence, 

<  s  -  y0,  y  >=  0, 
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for  all  y  G  L0. 
If.  Suppose 

<  s  -  2/0,2/  >=  0, 

for  all  y  €  L0.  Then 

<s-y,s-y>  -  <s-yo,s-y0> 

=<  s  -  y0  +  s/o  -  y,  $  -  yo  +  yo  -  y  >  -  <  *  -  yo,  s  -  yo  > 

=<yo-y,yo-y>,  (2.2.4) 

which  is  n.  n.  d.  The  last  equality  follows  since  <s  —  y0,y0  -  y  >=  0-  This  implies 

<  s  -  y0,  s  -  yo  >=  min  <  s  -  y,  s  -  y  >  . 

Finally  we  show  that  if  an  orthogonal  projection  exists,  then  it  is  unique.  Suppose 
that  2/1 , 2/2  £  L0  are  both  orthogonal  projections  of  s  into  L0.  Then 

<  s-yi,y  >=  0, 
for  all  y  e  L0,   i  —  1,2.  In  particular, 

<  2/i  -  2/2, 2/i  -  2/2  >=<  s  -  2/2, 2/i  -  2/2  >  -  <  s  -  yu  yx  -  2/2  > 

=  0-0  =0. 

So  2/1  =  2/2-  This  completes  the  proof  of  the  theorem. 

Next  we  apply  Theorem  2.2.1  to  generalize  a  result  of  Lehmann-Scheffe  to  the 
multidimensional  case.  Let  A*  be  a  sample  space,  0  C  Rk  an  open  set,  and  7  :  G  — ► 
Rd  an  estimable  function,  i.  e.,  there  exists  g  :  X  — >■  Rd  such  that 

E\g{X)\e)  =  7(0), 

Let 

[/,  =  {0  :  A'  — >  ^^^(X)^]  =  7(0),   V  0  €  0}. 
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U0  =  {h  :  X  — *  Rd\E[h{X)\9]  =  0,  V  9  E  0}, 

where  geUy,heU0  satisfy  that  0(A")<|0]  and  are  all  well 

defined.  Note  that  g*  E  U1  is  a  locally  minimum  variance  unbiased  estimator  of  7(0) 
at  0  =  90  if 

£[<7,(X)  ^(A')^o]  =  min   E[g(X)  g(X)%]. 

get/-, 

Also  it  is  easy  to  see  that 

U-y  =  g  +  U0,     V96  U1. 

Thus  as  an  easy  consequence  of  Theorem  2.2.1,  we  have  the  following  generalization 
of  the  Lehmann-Scheffe  theorem. 

Corollary  2.2.1.  With  the  same  notation  as  above,  gt  G  J77  is  a  locally  minimum 
variance  unbiased  estimator  of  7(0)  at  9  =  9q  iff 

<  g*,h  >e=  E[g*(X)  h(Xy\90}  =  0, 

V  heUy. 

Next  we  show  that  for  any  finite  dimensional  subspace  in  a  generalized  inner 
product  space,  the  orthogonal  projection  always  exists.  In  order  to  do  this,  the  fa- 
mous Gram-Schmidt  orthogonalization  procedure  is  used  in  generalized  inner  product 
spaces.  We  need  another  definition. 

Definition  2.2.3.  Let  (L,  <  ., .  >)  be  a  generalized  inner  product  space.  A  set 
of  functions  {hi}^  is  said  to  be  linearly  independent,  if  for  any  set  of  k  x  k  matrices 
{i4<}£i\  defining 

G\  ——  h\i         &i  ——  hi    5jj-_|^4.j  hj^    %     2,  •  •  •  j  7i j 

{<  ei,ei  >:  i  E  {1, . . . ,  n}}  are  all  invertible. 

The  following  is  the  Gram-Schmidt  orthogonalization  procedure  in  generalized 
inner  product  spaces. 
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Proposition  2.2.1.  If  {/ij"=1  is  linearly  independent,  let 

ex  =  hi,     e2  =  h2-  <  h2,ex  ><  ex,ex  >_1  ex, 

ek  =  hk-  Sf'i  <  hk,  e{  ><  et,  el  >_1  e{,     k  €  {2, . . .  ,n}. 

Then  {ej}"=1  are  orthogonal. 
Proof.  First  note  that 

<  e2,  ex  >=<  h2,  ex  >  —  <  h2,  ex  >=  0. 

Now  suppose  that 

<  em,ej  >=  0,     V  1  <  j  <  m  G  {2, . . . ,  k  —  1}. 
Then  for  all  j  €  {1,. .  .,m} 

<  em+1,ej  >=<  hm+i,ej  >  -  <  hm+uej  >=  0, 

so  that  {ej}-^!  are  orthogonal. 

The  above  result  is  used  to  prove  the  existence  of  the  orthogonal  projection  of 
every  element  of  a  generalized  inner  product  space  into  a  finite  dimensional  subspace. 

Theorem  2.2.2.  Let  (L,  <  .,.  >)  be  a  generalized  inner  product  space,  and  let 
I/0  be  a  finite  dimensional  subspace  of  L  with  linearly  independent  basis.  Then  for 
any  s  €  L,  the  orthogonal  projection  of  s  into  L0  always  exists. 

Proof.  From  Proposition  2.2.1,  without  loss  of  generality,  we  can  assume  that 
{h\, . . . ,  hm}  is  an  orthogonal  basis  for  L0.  Let 

At  =<  s,  hi  ><  hi,  hi  >_1,     i  £  {1, . . . ,  m}. 

We  claim  that  the  orthogonal  projection  of  s  into  L0  is 


13 


To  see  this,  for  any  h  =  hj  6  L0, 

<  s  -  K,  h  >=<  s,h>  -  <  h*,h  > 

=  E™=1  <  s,     >     -  E^E^  <  hi,  hj 
=  £™=1  <  s,     >  6j  -  T,f=1Aj  <  hj,  hj  >  b) 
=  T,f=l[<  s,  hj  >  -Aj  <  hj,  hj  >]b*j 
=  0. 

Now  apply  Theorem  2.2.1. 

Theorem  2.2.2  will  be  used  repeatedly  in  the  subsequent  sections  for  the  derivation 
of  optimal  estimating  functions. 

Next  we  establish  an  abstract  information  inequality,  which  is  fundamental  to 
our  later  study.  The  motivation  for  the  following  definition  will  be  clear  from  the 
subsequent  section  where  we  define  information  related  to  an  estimating  function. 

Definition  2.2.4.  Let  (L,  <.,.>)  be  a  generalized  inner  product  space,  let  s  G  L 
be  a  fixed  element.  For  any  g  G  L,  the  information  of  g  with  respect  to  s  is  defined 
as 

Ig  =<  g,s  >f<  g,g>~<  g,  s  >,  (2.2.5) 

where  "-"  denotes  a  generalized  inverse. 

We  shall  also  need  the  following  theorem  later. 

Theorem  2.2.3.  Let  (L,  <.,.>)  be  a  generalized  inner  product  space,  and  let 
L0  be  a  linear  subspace  of  L.  For  any  s  G  L,  suppose  g*  is  the  orthogonal  projection 
of  s  into  Lq.  Consider  the  function 

Ig  =<  g,S  >l<  g,g  >"<  g,s  >  . 

Then 
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is  n.  n.  d.,  for  all  g  G  L0. 

Proof.  Let  s  =  gt  +  h.  Then  using  <  g,  h  >=  0, 

Ig  =<  g,s  >l<  g,g>~<  g,s  > 


=<  g,g*>  <  g,g>  <  g,g*  >, 


Also,  using  <      h  >=  0, 


h,  =<  9*,  s  >'<  g*,g*  >  <  g*,  s  > 


<  g*,g*  > 


Now  consider  the  matrix 


<g*,g*>  <g,g*> 
<g-,g*>  <g,g> 


For  any  k-dimensional  vectors  a  and  b,  we  have  that 


<g*,g*>  <g,g*>1 
<g,g*>  <g,g> 


a 

b 

Thus 


a'  <  g„,  g,  >  a  +  2al  <  g,  gt  >'  b  +  bl  <  g,  g  >  b 
=<  ctg,  +  tfg,  a'g*  +  b*g  >>  0. 


<  g*,g,  >   <  g,g*  >l 

<g,g*>  <g,g> 

is  n.  n.  d.,  which  implies  that 

Ig.  -  ig  =<  g*,g*  >  -  <  g^g*  >*<  9,9  >"<  9,9*  > 

is  n.  n.  d.  The  proof  of  the  theorem  is  complete. 

The  following  result  will  be  used  to  establish  the  essential  uniqueness  of  optimal 
estimating  function. 
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Theorem  2.2.4.  With  the  same  notation  as  above,  if  g*,g  £  L0,  and  g*  is  the 
orthogonal  projection  of  s  into  L0,  and  <  g*,g*  >,<  g,g  >  are  invertible,  then 
Ig.  =  Ig  if  and  only  if  there  exists  an  invertible  matrix  M  such  that 

9*  =  M  g. 

Proof.  //.  If  g*  =  M  g,  by  straightforward  calculation,  we  get 
Ig.  =<  g*,s>l<  g*,g*  >~l<  g\s> 
=<  g,s  >l  Ml[M  <  g,g>  M']_1M  <  g,s> 
=<  g,s  >'<  g,g  >_1<  g,  s  >  . 
Only  if.  If  Ig*  =  Ig,  note  that 

Ig.  =<  g*,s>l<  g*,g*  >~l<  g*,s>=<  g*,g*  >, 

and 

Ig  =<  g,s  >'<  g,g  >~x<  g,s  >=<  g,g*  >'<  g,g>~l<  g,g*  >. 
since  g*  is  the  orthogonal  projection  of  s.  Then 

0  =  /9.  -Ig=<g*,g*  >  -  <g,g*  >l<  g,g  >_1<  g,g*  >  . 

Let  M  =<  5,5*  >l<  g,g  >_1,  then  it  is  easy  to  verify  that 

<g*-M  g,g*-M  g>=<g\g*  >  -M  <  g,g*  >  -  <g*,g  >  M*  +  M  <g,g>Ml 

=<  g*,g*  >  -<g,g*  >*<  g,g  >_1<  g,  g*  >=  0, 
so  that  g*  =  M  g. 

As  an  easy  consequence  of  Theorems  2.2.2-2.2.4,  we  have  the  following  corollary. 

Corollary  2.2.2.  Let  (L,  <.,.>)  be  a  generalized  inner  product  space,  and  let 
L0  be  a  finite  dimensional  subspace  of  L  with  linearly  independent  basis.  For  all 
s  €  L,  and  g  €  L0.  let 

J9  =<  0,s  >*<  g,g  >"<  0,s  >  . 
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Then  there  exists  g*  G  L0  such  that 

lt  ~  h  (2-2.6) 

is  n.  n.  d.,  for  all  g  G  L0.  Futhermore  if  <  g*,g*  >  and  <  g,g  >  are  invertible,  then 
Ig*  =  Ig  if  and  only  if  there  exists  an  invertible  matrix  M  such  that 

9*  =  M  g. 

Proof.  Since  L0  is  a  finite  dimensional  subspace  of  L  with  linearly  independent 
basis,  then  by  Theorem  2.2.2,  for  any  s  G  L,  the  orthogonal  projection  g*  of  s  into  L0 
exists.  The  first  part  of  the  corollary  now  follows  from  Theorem  2.2.3.  The  second 
part  of  the  corollary  follows  from  Theorem  2.2.4. 

2.3    Optimal  Estimating  Functions:  A  Geometric  Approach 

In  this  section,  we  will  apply  the  results  obtained  in  the  previous  section  to  the 
theory  of  estimating  functions.  We  begin  with  the  definition  of  unbiased  estimating 
functions. 

Let  X  be  a  sample  space  and  6  be  a  A;  dimensional  parameter  space.  A  function 

g-.XxS  — >  Rk 
is  an  unbiased  estimating  function  if 

E[g(X,9)\9]  =  0,  V6»G0. 

An  unbiased  estimating  function  g  is  called  regular  if  the  following  conditions  hold: 

(i)  dkj{9)  =  E[^\6],  (1  <  i,j  <  k)  exists; 

(ii)  E[g(X,B)  g{X,9f\9]  is  positive  definite. 

Let  L  denote  the  space  of  all  regular  unbiased  estimating  functions.  For  gi,  g2  G  L, 
we  define  the  family  of  generalized  inner  products  of  <7i ,  g2  as 

<  9i,92  >e=  E\g1(X,9)g2(X,9)t\9}         W  G  0.  (2.3.7) 
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This  family  of  generalized  inner  products  will  be  used  throughout  this  section  without 
specific  reference  to  it.  Also  we  shall  denote  by  s  the  score  function  of  a  parametric 
family  of  distributions.  We  assume  also  that  the  score  vector  is  regular  in  the  sense 
described  in  (i)  and  (ii). 

Definition  2.3.1.  With  the  same  notation  as  above,  let  (L,  <  .,.  >g)  be  the 
family  of  generalized  inner  product  spaces,  and  let  L0  be  a  subspace  of  L.  For  any 
g  G  Lq,  the  information  function  of  g  is  defined  as  follows 

W)  =  E{f9\e}t<f,,9>elE[fe\9}  (2.3.8) 

An  element  g*  G  L0  is  said  to  be  an  optimal  estimating  function  in  L0  if 

is  n.  n.  d.,  for  all  g  G  L0  and  6  G  0. 

Next  we  prove  a  key  result  which  shows  that  definition  (2.3.8)  is  indeed  equivalent 
to  definition  (2.2.5)  of  the  previous  section. 

In  the  rest  of  this  section,  unless  otherwise  stated,  we  shall  assume  the  following 
regularity  condition  for  unbiased  estimating  functions. 

(11).  For  any  g  G  L. 

E[^\9]  =  -E^m  (2.3.9) 

Lemma  2.3.1.  Under  the  regularity  condition  (11),  for  any  g  G  L,  the  information 
matrix  of  g  can  be  written  as 

Ig(9)  =<  g,s  >0<  g,g  >g<  g,s>0, 

where  s  is  the  score  function. 

Proof.  The  result  follows  easily  since  for  any  g  G  L,  use  (2.3.9)  to  get 

-<g,s>e=E[fe\6). 
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Theorem  2.3.1.  Let  L0  be  a  subspace  of  L.  Assume  that  the  orthogonal  projec- 
tion g*  of  s  into  L0  exists.  Then 

I9{e)<Ig.(9),         V0G0,     geL0,  (2.3.10) 

that  is  g*  G  L0  is  an  optimal  estimating  function  in  L0.  The  optimal  element  in  L0 
is  unique  in  the  following  sense:  if  g  €  LQ  ,  then  Ig(9)  =  Ig-(9),  V0  G  0,  if  and  only 
if  there  exists  invertible  matrix  valued  function  M  :  0  — ►  Mkxk  such  that  for  any 
9  G  8, 

g*(X,d)  =  M(9)  g(Xt0),  (2.3.11) 

with  probability  1  with  respect  to  Pg. 

Proof.  The  first  part  follows  easily  from  Lemma  2.3.1  and  Theorem  2.2.3.  The 
second  part  follows  from  Theorem  2.2.4. 

Note  that  if  L0  is  a  finite  dimensional  subspace  of  L,  from  Theorem  2.2.2,  an 
orthogonal  projection  g*  of  s(  G  L)  into  L0  always  exists,  so  that  the  conclusions 
given  in  (2.3.10)  and  (2.3.11)  always  hold.  Also,  in  this  case,  Proposition  2.2.1  and 
Theorem  2.2.2  show  how  to  construct  optimal  estimating  functions. 

In  the  remainder  of  this  section,  we  shall  see  several  applications  of  Theorem  2.3.1 
for  deriving  optimal  estimating  functions  in  different  contexts.  We  begin  by  general- 
izing a  result  of  Godambe  (1985)  when  the  parameter  space  is  multidimensional.  Also 
we  bring  out  more  explicitly  the  characterization  of  optimal  estimating  functions  in 
a  more  general  framework  than  what  is  given  in  Theorem  1  of  Godambe  (1985).  Let 
{Xi,  X2, . .  • ,  Xn}  be  a  discrete  stochastic  process,  0  C  Rk  be  an  open  set.  Let  hi  be 
a  Rk  valued  function  of  Xi, . . . ,  Xi  and  9  to  Rh,  such  that 

Ei-l[hi(Xu...,Xi;9)\6}  =  0,     (z  =  l,...,n,   9  G  0),  (2.3.12) 

where  denotes  the  conditional  expectation  conditioning  on  the  first  i— 1  variables, 
namely,  X\, . . . ,  Xi-\.  Let 

L0  =  {g:   g  =  EJU^-i  hi}, 
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where  A^i  is  a  MkXk  valued  function  of  Ari, . . . ,  Xi-\  and  0,  for  alU  G  {1, . . . ,  n). 

The  following  theorem  generalizes  the  result  of  Godambe  (1985). 

Theorem  2.3.2.  With  the  same  notations  as  above,  suppose  hi  satisfies  the 
regularity  condition  {TV).  Let 

A\  =  El-A^\i  Ei-iihi  hWe}-1     Vi  G  {1,2,..., n}, 

and 

g*  =  TZ=1A* 

Then  the  following  conclusions  hold: 

(a)  ,  g*  is  the  orthogonal  projection  of  s  into  LQ. 

(b)  .  g*  is  an  optimal  estimating  function  in  L0,  i.  e., 

Ig(0)<I9M, 

for  all  g  G  L0  and  9  G  0. 

(c)  .  If  g  G  L0  and  04|0]  is  invertible,  then  /9(0)  =  Ig.(d),  for  all  0  €  ©  if  and 
only  if  there  exists  an  invertible  matrix  function  M  :  9  — >  Mkxk  such  that  for  any 

e  e  e, 

g*{Xu  . .  • ,  Xn;  9)  =  M{9)  g(Xu  . . . ,  Xn;  9), 

with  probability  1  with  respect  to  Pg. 

Proof,  (a).  For  any  g  =  £"=1Aj  frj  G  L0,   0  G  6, 

<  s-  g*,g  >8=<  s,g  >e  -  <  g*,g  >e 

=  ZUEls  h\  A\\9]  -  ZtiZUm  hi  h)  A)\9] 
=  E?=lE{Ei-i[s  h\  A\\9)\9)  -  ^=1E[A;  fu  h\  A\\9] 
-ZlK]E[A*  fu  h)  A)\9\  -  ^t>]E[A*  hi  h)  A)\9\  (2.3.13) 
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But  for  i  <  j, 

E[A*  hi  h)  A)\9]  =  E{E3_X[A\  ^  h)  A)\9)\9] 

=  e{a*  hiEj-x[h)  A)\e\\e}  =  Q. 

Similarly,  for  i  >  j, 

E[A*  hi  h)  A$|0]  =  O. 
Thus  from  equation  (2.3.13),  we  get 

<  s  -  g*,g  >e=  ^MEU^m- 

hl\0}  A\\B}=0. 

Hence  g*  is  the  orthogonal  projection  of  s  into  L0. 

Parts  (b)  and  (c)  of  the  theorem  follows  easily  from  part  (a)  and  Theorem  2.3.1. 

A  second  application  of  Theorem  2.3.1  is  to  give  a  geometric  formulation  of  a  result 
of  Godambe  and  Thompson  (1989),  who  proved  the  existence  of  optimal  estimating 
functions  using  mutually  orthogonal  estimating  functions.  What  we  show  is  that  the 
optimal  estimating  function  of  Godambe  and  Thompson  is  indeed  the  orthogonal 
projection  of  the  score  function  into  an  appropriate  linear  subspace. 

To  this  end,  let  X  denote  the  sample  space,  9  =  (#!,..., 0m)  be  a  vector  of 
parameters,  hj,j  =  l,...,kbe  real  functions  on  X  x  0  such  that 

E[hj(x,e)\9,Xj]  =  o,    v#ee,    3  =  i,...,k, 

where  Xj  is  a  specified  partition  of  X,  j  =  1,. . . ,  k.  We  will  denote 

E[.\e,XJ]  =  EU)[.\9}. 
Consider  the  class  of  estimating  functions 

Lo  =  {g  ■  g=  (#i,---,tfm)} 
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where 

gT  =  ^=lqjrhj,     r  =  l,...,m, 

qjr  :  X  x  0  — ►  i?  being  measurable  with  respect  to  the  partition  Xj  for  j  = 
l,...,k,r=  l,...,m. 
Let 

for  all  j  =  1, . . . ,  k,  r  =  1, . . . ,  m,  and 


9r  =  £*=l?ir  *ii      r  =  l,..., 


m. 


The  estimating  functions  /iJ?  j  =  1, . . . ,  A;  are  said  to  be  mutually  orthogonal  if 

JW&Vj'A'  1^1  =0>     V  j  #  j',r, r'  =  1, . . . ,m.  (2.3.15) 

Theorem  2.3.3.  With  the  same  notations  as  above,  if  {hj}kj=l  are  mutually 
orthogonal,  then  the  following  hold: 

(a)  g*  is  the  orthogonal  projection  of  the  score  function  s  into  L0. 

(b)  g*  is  an  optimal  estimating  function  in  L0. 

(c)  .  If  g  e  L0,  and  E[g  gl\e]  is  invertible,  then  Ig(G)  =  Ig.(6),  V  9  G  6  if  and 
only  if  there  exists  an  invertible  matrix  function  M  :  0  — >  Mfcxfc  such  that  for  any 
#€10, 

g\X-e)  =  M{0)  g(X;6), 

with  probability  1  with  respect  to  Pg. 

Proof.  (1).  We  only  need  to  show  that,  V  r  G  {1, . . . ,  m},  gr  =  T,j=lqjrhj. 

<s-g*r,gr>g=0,  V0G0 


i.e., 


<  s,gr  >e=<  g*r,gr  >e,     V  Q  G  0. 
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But 

<  g*r,gr  >e=  Xkj={Ekj,=lE[q*jrhjqj.rhj.\0] 

=  ZkJ=1E{q*rqjrEU){h2J\9]\6} 
=  Xkj=1E{qjrEu)[^\6]\e}. 

Also 

<  s,gr  >0=  Zkj=iE[qjr  s  hj\9] 
=  Hkj=lE{qjrE{j)[shj\e)\B} 

=  Ekj=1E{qjrEU)Q\0]\e}. 

Thus  g*  is  the  orthogonal  projection  of  the  score  function  into  L0. 

Once  again  (b)  and  (c)  follows  from  part  (a)  and  Theorem  2.3.1.  This  completes 
the  proof. 

Note  that  part  (b)  of  Theorem  2.3.3  is  due  to  Godambe  and  Thompson  (1989), 
while  the  other  two  parts  are  new.  We  repeat  that  this  theorem  provides  a  geomet- 
ric formulation  of  optimal  estimating  functions  in  a  finite  dimensional  subspace  of 
estimating  functions.  Also  through  this  approach,  the  characterization  of  optimal 
estimating  function  is  very  easy  to  establish. 

Finally,  we  apply  the  above  result  to  obtain  optimal  generalized  estimating  equa- 
tions for  multivariate  data.  Let  Xj  denote  the  sample  space  for  the  jth  subject, 
0  C  Rd  be  a  subset  with  nonempty  interior, 

ttj  :  Xj  x  Q  — i  Rn\     i  =  1, . . . ,  k, 

such  that  E[ui(Xi,8)\6]  =  0,V  6  £  6.  Suppose  that  conditional  on  9,  {ui(Xi,  #)}f=1 
are  independent.  Consider  the  estimating  space 

LQ  =  {^=1Wi(0)ui(Xi,9)} 
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where  Wi(8)  is  a  d  x  m  matrix,  i  =  1, . . . ,  k.  Let 

W;(e)  =  E[d^\eY[Var(ul\9)]-\     z  =  h...,k, 

g*  =  ^=lW*(9)  Ui(Xu$). 

Then  we  have  the  following  result. 

Theorem  2.3.4.  With  the  same  notations  as  above, 

(a)  g*  is  the  orthogonal  projection  of  the  score  function  into  L0. 

(b)  g*  is  an  optimal  estimating  function  in  L0. 

(c)  If  g  G  L0,  and  E[g  gl\9]  is  invertible,  then  Ig(9)  =  Ig-{9),  V6>  G  0  if  and  only 
if  there  exists  an  invertible  matrix  function  M  :  0  — ►  Mkxk  such  that  for  any  #6  0, 

9*(x-e)  =  M(e)  9{x-e), 

with  probability  1  with  respect  to  Pg. 

Proof,  (a).  We  only  need  to  show  that  V/y  =  Ef=1Hr2(#)  ul{Xt:9). 

<  s,g  >e=<  g*,g  >e  . 

But 

<  g\g  >e=  Eti£*=iWi*  <  u^fl),^^,*)  >9 
=  Sf=1^*  <  Ui(Xi,8),Ui(Xi,9)  >e  W\ 

also 

Thus      is  the  orthogonal  projection  of  the  score  function  into  L0. 
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Parts  (b)  and  (c)  follows  from  part  (a)  and  Theorem  2.3.1. 

Note  that  by  choosing  the  appropriate  the  functions  of  u*,  we  can  very  easily 
get  the  generalized  estimating  equations  introduced  by  Liang  and  Zeger  (1986).  For 
further  information  about  generalized  estimating  equations,  we  refer  to  Liang,  Zeger 
and  Qaqish  (1992). 

2.4    Optimal  Bavesian  Estimating  Functions 

In  this  section,  we  study  the  geometry  of  estimating  functions  within  a  Bayesian 
framework.  There  are  two  basic  approaches  here.  One  formulation  is  based  on  the 
joint  distribution  of  the  data  and  prior,  as  introduced  by  Ferreira  (1981,  1982).  The 
second  formulation,  due  to  Ghosh  (1990),  is  based  on  the  posterior  density.  We  shall 
study  both  and  see  how  the  notion  of  orthogonal  projection  can  be  brought  within 
Bayesian  formulation  as  well. 

We  begin  with  Ferreira's  (1981,  1982)  formulation.  Let  X  be  the  sample  space, 
0  C  Rk  be  an  open  set,  p(x\9)  be  the  conditional  density  of  X  given  9,  and  ir(9)  be 
a  prior  density.  Let  g  :  X  x  O  — >  Rk  be  a  function  such  that 

(1)  §f  exists,  V0  e  O; 

(2)  E[g(X,9)g(X,9)t]  is  invertible,  where  E  denotes  expectation  over  the  joint 
distribution  of  X  and  9. 

Let  L  denote  the  set  of  all  functions  g  :  X  x  Q  — ►  Rk  which  satisfy  (1)  and  (2) 
above.  The  generalized  inner  product  on  L  is  defined  by 

<  f,g  >=  E[f(X,9)g(X,9)%     V  /,  g  €  L,  (2.4.16) 

It  is  straightforward  to  verify  that  (2.4.16)  is  a  generalized  inner  product  on  L. 

The  following  calculation  will  be  used  to  serve  as  a  key  connection  between  the  for- 
mulation of  Ferreira  about  optimal  Bayesian  estimating  functions  and  our  geometric 
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formulation.  It  also  provides  a  geometric  insight  to  the  result  of  Ferreira.  Through- 
out this  section,  we  shall  always  assume  that  p(X\9)  and  7r(0)  are  differentiate  with 
respect  to  9. 

Lemma  2.4.1.  Let  n(9\X)  be  the  posterior  density,  and 


d\ogn{9\X) 


09, 

and  gi  :  X  x  0  — >  R  be  a  function.  Then 


v?e  {!,..., *}, 


BW=.{,|I+E{«+W«(,  (,,17) 

Proof. 

T7\      i     p/pr  dlog7r(0|X) 
E[9i  Sj]  =  E{E[gi   —  \8\] 

=  E{El9U^^m}  +  E{E[9i9-^)m 

This  completes  the  proof. 

Note  that  if  E[gi\9]  =  0,  then 

E[9i  Sj]  =  (2.4.18) 

also  if  gi  is  only  a  function  of  9,  then 

E[giSj)  =  E{E[gt  sffl]  =  E[9i  d^6\  (2.4.19) 
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Suppose  now 


for  z  =  l,...,k,j  =  where  (7  =  (51, . . .  ,gk).   Let  s  =  (slf . . .  ,sk),  using 

Lemma  2.4.1,  then 

<g,s>=-((E[^}-Bl3(g)))- 
If  E[g\9]  =  0,  from  (2.4.17)  and  (2.4.18), 

<g,s>=-E[^];  (2.4.21) 

also  if  g  is  only  a  function  of  0,  then 

<9,S>=*b  (^§£%  (2.4.22) 

Now  by  combining  the  previous  theorem  and  the  above  lemma,  we  have  the  following 
result,  which  is  a  generalization  of  the  main  result  due  to  Ferreira  (1981,  1982)  to  the 
multidimensional  case. 

Theorem  2.4.1.  For  g  e  L,  let 

Mg  =  E[g(X,9)  g(X,d)%  (2.4.23) 

then 

mw}  - BM))t  M?       - Bij{9)))  - M" 

for  all  g  G  L. 

Proof.  From  the  previous  Lemma, 


<g,s>=-((E[^-]-BlJ(g))). 
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Also  Ms  =<  s,s  >l<  s,s  >  l<  s,s  >.  Thus  the  result  follows  easily  from  Theorem 
2.2.3. 

Note  that  if  A;  =  1,  the  above  theorem  reduces  to  the  result  proved  by  Ferreira 
(1981,  1982). 

For  any  g  €  L,  let 

/,  =  -  Bij(g))Y  Mg1  ((E[^)  -  B^g))).  (2.4.24) 

In  the  definition  of  Ig,  ((E[^]  -  Bij(g)))  is  a  measure  of  sensitivity  of  g,  and  Mg  is 
a  measure  of  variability  of  g.  Thus,  analogous  to  the  frequentist  case,  the  following 
definition  seems  to  be  appropriate  about  optimal  estimating  function  in  the  Bayesian 
framework. 

Definition.  If  L0  is  a  subspace  of  L,  and  g*  6  L0,  g*  is  called  an  optimal  Bayesian 
estimating  function  in  L0,  if  for  any  g  6  L, 

Ig<Ir. 

Next  we  prove  an  optimality  result  about  Bayesian  estimating  functions  in  this 
formulation. 

Theorem  2.4.2.  With  the  same  notation  as  above,  the  generalized  inner  product 
on  L  is  defined  by  (2.4.16),  and  let  L0  be  a  subspace  of  L,  if  g*  is  the  orthogonal 
projection  of  s  into  L0,  then  we  have  that 

(1)  g*  is  an  optimal  Bayesian  estimating  function  in  L0; 

(2)  the  optimal  Bayesian  estimating  function  in  L0  is  unique  in  the  following  sense: 
for  any  g  6  L0,  Ig  =  Ig,  if  and  only  if  there  exists  an  invertible  k  x  k  matrix  M  such 
that  g*  =  M  g. 

Proof.  (1).  From  Theorem  2.2.3, 

<  g,s  >*  <  g,g>~1  <  g,s><<  g*,s>*  <  g*,g*  >_1  <  g*,s  >, 
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for  all  g  G  L0.  But  from  Lemma  2.4.1, 

Ig  =<  g,s>1  <g,g  >_1  <  g,  s  >. 

for  any  g  G  L0-  Thus  the  result  follows  easily. 
(2)  follows  easily  from  Theorem  2.2.4. 

Next  we  apply  Theorem  2.4.2  to  a  case  where  Lq  is  a  finite  dimensional  subspace 
of  L,  with  linearly  independent  basis. 

Let  {ui(Xi,  6)}iLi  be  a  family  of  x  1  vectors  of  parameteric  functions  and  v(9) 
be  a  m  x  1  vector  such  that 

(1)  .  For  fixed  ^Gfi,  Uj(.,  9)  :  X  — >  Rni  is  measurable; 

(2)  .  v  :  0  — >  Rm  is  measureable; 

(3)  .  E[ui\0]  =  0  ,  and  E[v]  =0; 

(4)  .  Conditional  on  9,  {ui{Xu  9)}f=l  are  independent. 
Consider  the  space  of  estimating  functions  of  the  form 

L0  =  {£g1[Wi(9)ui{Xi,Bi)}   +  Qv(6)}, 

where  for  any  9  G  0,  Wi(9)  is  a  p  x  rii  matrix,  for  alU  G  {1, . . . ,  K},  and  Q  is  a  p  x  m 
matrix. 

Theorem  2.4.3.  With  the  same  notation  as  above,  let 

W?(9)  =  E[^\9nE[Var(Ul\9)}r\     Q*  =  E[v{8)  (<^y](E[v(0)  v{fffl)'\ 
and 

9^  =  Ell(w;(9)  miXM  +  qr  v(9). 

Then 

(a)  g*  is  the  orthogonal  projection  of  s  into  LQ; 

(b)  g*  is  an  optimal  Bayesian  estimating  function  in  Lq; 
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(c)  optimal  Bayesian  estimating  function  in  LQ  is  unique  in  the  following  sense:  if 
g  e  L0,  then  Ig  =  Ig-  if  and  only  if  there  exists  an  invertible  matrix  M  such  that 

g*(Xu . . . ,  XK;  9)  =  M  g(Xu  . . . ,  XK;  9), 

with  probability  1  with  respect  to  the  joint  distribution  of  the  Xi  and  9. 
Proof,  (a).  For  any  g  =  Ejjj^Wip)  ui(Arl,^))  +  Q  u(0), 

<  s- g\g  >=<  s,g  >  -  <  g*,g  >  . 

But 


<  8,g  >=  ZlxE{E[s  +  £?[a  w(0)«]  Q' 

and 

<  g\g  >=  J&tfWMflMX'i,*)  UiiXudyWWiVY}  +  Q*E[v(9)  v(9)t]Qt 

=  -s^^i^^^i^]^^)'}  +  *[„(*)  (^p)T  Q(- 

Thus  by  Theorem  2.2.1,  g*  is  the  orthogonal  projection  of  s  into  L0. 
Parts  (b)  and  (c)  follows  from  (a)  and  Theorem  2.4.2. 

Next  we  turn  to  the  formulation  of  Bayesian  estimating  functions  introduced  by 
Ghosh  (1990).  In  this  formulation,  the  parameter  space  is  assumed  to  have  the  form 
0  =  (ai,fti)  x  ...  x  (ak,bk).  We  start  with  a  result  which  is  very  similar  to  Lemma 
2.4.1. 

Lemma  2.4.2.  Let  n(9\X)  be  the  posterior  density,  and  Sj  =  dloSg^X\j  = 
1, . . . ,  k,  and  ^  :  X  x  0  — >  R  be  a  function  with  suitable  regularity  condition,  then 

E[9i  Sj\X]  =  -E[^\X]  +  EpjbtilX], 
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where 


Bj(gi)=  lim  9i(X,6)n(9\X)-  \im  9i(X,d)  n{9\X). 


Proof.  Note  that 

drr(0\X) 


E{gt  Sj\X]  =  /  ft 

J  0 


e "  dOj 


,10 


=  E[Bj(gi)\X}  -  E[^\X}. 

Next  the  definition  about  posterior  estimating  functions  is  introduced.  A  function 
g  :  Q  x  X  — >  Rk  is  called  a  posterior  unbiased  estimating  function  (PUEF)  if 

E[g(6,X)\X}  =  0,  (2.4.25) 

Bj(gi)=0,     Vx€X,   i,je{l,...,k}.  (2.4.26) 
Actually,  all  we  require  is  that 

E[Bj{9i)\X]  =  0,     VxeX,  i,j€{l,...,fc}. 

Let  L  be  the  space  consists  of  all  functions  g  :  0  x  X  — ►  Rk,  which  is  PUEF 
and  E[g  gl\X)  is  invertible.  A  family  of  generalized  inner  products  on  L  is  defined  as 
follows:  for  any  f,g  G  L,  and  x  G  X, 

<f,g  >x=  E[f($,X)  g(6,Xf\X  =  x].  (2.4.27) 

If  the  score  function  s  G  L,  then  from  Lemma  2.4.2, 

<g,s>x=-((E[^-\X  =  x})). 

Next  for  every  g  G  L,  x  G  X,  define 

h{x)  =  ((E[^\X  =  x})Y  (E\g(d,X)  g(8,X)t\X  =  x})-1  ((E[^\X  =  x})). 

(2.4.28) 
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Let  L0  be  a  subspace  of  L;  g*  G  L0  is  said  to  be  an  optimal  element  in  L0  if 

I9(X)<I9.(X), 

for  all  g  G  L0,  and  x  G  X. 

The  following  result  now  follow  very  easily. 

Theorem  2.4.4.  With  the  same  notation  as  above,  suppose  that  the  orthogonal 
projection  g*  of  s  into  L0  exists  with  respect  to  the  generalized  inner  products.  Then 
g*  is  optimal  in  L0,  that  is,  for  all  g  G  L0,  we  have  that 

Ig(x)  <  Ig-(x), 

Vx  G  X.  Furthermore,  the  optimal  element  in  L0  is  unique  in  the  following  sense:  if 
g  G  LQ,  then  Ig(x)  =  Ig>  (x),  Vx  G  X  if  and  only  if  there  exists  an  invertible  matrix 
valued  function  M  :  X  — ►  Mkxk  such  that 

g(B;x)  =  M(x)  g*{6;x). 

Proof.  The  first  part  of  the  theorem  is  a  consequence  of  Theorem  2.2.3,  and  the 
second  part  is  a  consequence  of  Theorem  2.2.4. 

Note  that  if  s  e  L0,  then  s  is  an  optimal  estimating  function. 

As  a  corollary  of  Theorem  2.4.4,  we  have  the  following  generalization  of  a  result 
due  to  Godambe  (1994)  about  optimal  estimating  functions  to  multi-dimensional 
parameter  space. 

Corollary  2.4.1.  If  g*  G  L0  is  the  orthogonal  projection  of  s  into  Lq,  then 

(a)  Ig(x)  <  Ig*(x),  for  all  g  G  Lq  and  x  G  X; 

(b)  E[{g*  -  s)  {g*  -  a)*|x]  <  E[(g  -  s)  {g  -  s)'\x],  for  all  g  G  L0)  and  x  G  X. 
Note  that  it  is  easy  to  see  that  if  the  parameter  space  is  one  dimensional,  then  (a) 

is  equivalent  to  corr{g* ,  s\x}2  >  corr {g ,  s\x}2 ,  for  all  g  G  L0  and  x  G  X.  This  is  the 
result  proved  by  Godambe  (1994). 
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2.5    Orthogonal  Decomposition  and  Information  Inequality 

In  this  section,  we  give  a  geometric  intuition  of  some  information  inequalities.  Let 
us  start  with  one  of  the  main  results  of  this  section. 

Theorem  2.5.1.  Suppose  L0  is  a  subspace  of  L,  and  for  all  g  G  L,  let  g0  be  the 
orthogonal  projection  of  g  into  L0.  Also,  let  s  be  the  score  function. 

(i)  .  If  <  g  -  g0,  s  >e=  0,  V  9  G  6,  then 

ig(9)  <  i^ei      v  e  e  e. 

(ii)  .  If  <  g0,  s  >g=  0,   V  6  €  6,  then 

Proof.  Note  that 

Ig{0)  =<  g,s>tg<g,g  >gl  <  g,  s  >e,     V  0  g  0. 

(i)  .  If  <  p  -      s  >o=  0,   V  0  €  0,  then  <  g,  s  >g=<  g0,  s  >0,   V  6  G  0.  Also, 

<  9, 9  >e=<  go,  9o  >e  +  <  9  ~  9o,9  -  9o  >e 

><go,go>e,  V^G0. 

Thus 

Ig(6)  <  1^(9),  V0G0. 

(ii)  .  If  <  g0,  s  >e—  0,  V  9  G  0,  then  <  g  -  go,s  >e=<  g,  s  >e,  V  0  G  0,  and 

<  g,  g  >e=<  go,  go  >e  +  <  g  -  go,  g  -  go  >e 

><g-go,g-go>o,       V  9  e  0. 

Thus 

Ig(9)  <  Ig-go(9),  V0G0. 
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As  an  application  of  the  previous  result,  we  have  the  following: 
Corollary  2.5.1.  If  T  is  a  statistic,  V  g  €  L,  let  g0  =  E[g(X,9)\T}.  Then 

(1)  if  T  is  sufficient,  then 

/,(*)< /,.(*),  V0G6; 

(2)  if  T  is  ancillary,  then 

i9(e)  <  W0),  v^e. 

Proof.  (1).  If  T  is  sufficient,  using  the  factorization  theorem,  we  have  that 

<  g  -  g0,  s  >e=  0,  V0G0, 

since  the  score  function  is  only  a  function  of  T.  The  result  follows  now  from  part  (i) 
of  the  previous  theorem. 

(2).  If  T  is  ancillary,  then  V  0  G  0, 

<g0,s>6=  E[g0(^±)<\e) 

=  E{gofTE[(dl°gdfexlTy\9,  T}\d}  =  0, 

where  fx  and  fx\r  are  the  marginal  and  conditional  pdf 's  of  X  respectively,  since 
the  conditional  score  function  has  zero  expectation  with  respect  to  the  conditional 
density. 

Thus  both  results  follow  from  the  previous  theorem  easily. 

Next  we  study  the  information  decomposition  for  information  unbiased  estimating 
functions.  Let  us  start  with  the  definition  introduced  by  Lindsay  (1982).  Let  g  be  an 
estimating  function.  Then  g  is  called  information  unbiased  if 

<g,s>e=<g,g>e,         V  6  €  6, 
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i.  e.  ,  g  and  s  —  g  are  orthogonal,  where  s  is  the  score  function. 

The  main  result  on  information  unbiased  estimating  function  is  given  in  the  fol- 
lowing theorem. 

Theorem  2.5.2.  Let  T  be  a  statistic  such  that  the  marginal  and  conditional 
densities  of  T  satisfy  the  usual  regularity  conditions.  For  all  g  E  L,  let  g0  =  E[g\T,  9], 
h  =  g  —  E[g\T,  9).  Suppose  that  g  is  information  unbiased.  Then 

(1)  if  h  is  information  unbiased  with  respect  to  fx\r,  then  g0  is  information  unbi- 
ased with  respect  to  fr; 

(2)  if  0o  is  information  unbiased  with  respect  to  fx,  then  h  is  information  unbiased 
with  respect  to  fx\T] 

(3)  if  at  least  one  of  (1)  or  (2)  holds,  then 

Ig(6)=Igo(6)  +  Ih(6),  V(?e0. 
Proof.  Let  sx  denote  the  score  function  of  X.  Then  we  have  that 

Sx  =  ST  +  SX\T, 

and 

<  9,sx  -9  >e=  E[g  {sx-gYie] 
=  E[{g0  +  h)  {sT  -  g0  +  sx\t  ~  h)l\9} 
=<  9o,  sT-go>e  +  <  h,  sx{T  -  h  >e  +E[g0  {sX\T  ~  h)l\e]  +  E[h  {sT  -  goY\9}. 
But 

E[go  (sX\t  -  hf\e]  =  E{g0  E[(sX\t  ~  h)l\T,  0}}  =  0, 
since  E[h\T,  9]  =  O,V0  e  6,  and 

E[h  (sr  -  goY\9}  =  E{E[h\T,  9}(sT  -  goy\9}  =  0, 

since  E[h\T,  9}  =  0,  V  9  e  6.  So 

<g,sx-g>e=<9o,sT-go>e  +  <h,sx\T-h>6,     V  9  e  6.  (2.5.29) 
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(1)  and  (2)  follows  from  the  above  equality. 

(3).  First  note  that  if  g  is  information  unbiased,  then 

Ig(B)=<g,g>9,  V0e0. 

Also  because  g  —  g0  +  h,  and  go,  h  are  orthogonal,  so 

<  9,9  >e=<  go,  9o>e  +  <  h,  h  >e,         V  9  e  0. 

this  implies  that 

Ig(9)  =  Igo(9)  +  Ih(6),  V^0. 

As  an  easy  consequence  of  the  above  theorem,  we  have  the  following  result  proved 
by  Bhapkar  (1989,  1991a). 

Corollary  2.5.2.  With  the  same  notation  as  the  above  theorem, 

isx(6)  =  iST(9)  +  isxtT(e),  vflee. 

Proof.  The  result  follows  by  noting  that  under  the  usual  regularity  conditions, 
St  and  sx\t  are  information  unbiased. 
Note  that  if  T  is  sufficient,  then 

7,X(0)  =  J,T(0), 

If  T  is  ancilary,  then 

isx(9)  =  iSxlT(9),  Vflee. 

2.6    Orthogonal  Decomposition  for  Estimating  Functions 

In  this  section,  we  prove  a  Hoeffding  type  decomposition  for  estimating  functions, 
revealing  the  geometric  nature  of  the  Hoeffding  decomposition  for  U  statistics. 

Let  Xi, . . .  ,Xn  be  an  independent  random  variables,  Xi  be  the  sample  space  for 
Xi,  i  =  l,...,n,0  be  the  parameter  space.  A  function 

h:         ...x  XnxO  — >  Rk, 
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is  called  an  unbiased  estimating  function  if 

E[h(xu...,xn-e)\e]  =  o,  v#ee. 

Let  £  consist  of  all  unbiased  functions  h  :  Xx  x  . . .  x  Xn  x  Q  — >  Rk,  such  that 

E[h(Xu...,Xn;0)  h{X1,...,Xn;6)t\8] 

is  well  defined  for  all  9  £  0. 

For  hi,h.2  G  £,  define  a  family  of  generalized  inner  product  of  h\,h2  as 

<  hx,h2  >e=  E[hx  h2\9). 

Then  (£,  <  ., .  >g)  is  a  generalized  inner  product  space  for  all  8  G  0. 
For  m  <  n,  let  £m  be  a  linear  span  of  the  functions  of  the  form 

h  :  Xh  x  ...  x  Xim  x6^  Rk, 

which  satisfies 

E[h{Xil,...,Xim;6)\6]  =  0, 

and  <  h,h  >e  is  well  defined  for  all  6  G  0,  where  {ti, . . . ,  im)  C  {1, . . . ,  n}. 

If  raj  <  m2  <  n,  then  can  be  regarded  as  a  subspace  of  £m2  in  the  obvious 
fashion.  Now  a  natural  question  is:  V  h  6  £,  m  <  n,  does  the  orthogonal  projec- 
tion of  h  into  £m  exist?  If  it  does,  how  do  we  find  it?  The  answer  to  the  above 
question  is  affirmative,  and  it  turns  out  that  the  answer  is  very  closely  related  to  the 
Hoeffding  type  decomposition  for  U  statistics.  Before  proving  the  main  result  of  this 
section,  let  us  introduce  some  further  notation  to  simply  our  presentation.  For  all 
/  =  {*!,...,  im}  C  {1, ....  n},  where  ix  <  . . .  <  im,  let 

f[Xj)  =  f(Xn , . . . ,  Xim),     E\g\Xj)  =  E[g\Xn Xim). 

Tk  =  {{ii,       k)  ■  1  <  i\  <  ...  <ik<  n},     V  1  <  k  <  n. 
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Now  we  return  to  the  orthogonal  decomposition  of  estimating  functions.  Let 
h  €  £;  V  J  =  {ii, . . . ,  im}  C  {1, . . . ,  n},  1  <  k  <  n,  let 

=  -  Xjci,J*iE[h\Xj],  (2.6.30) 

V  1  <  k  <  n,  let 

=  S/ei^/(X/).  (2.6.31) 

Then  we  have  the  following  result. 

Theorem  2.6.1.  Let  S  and  Sm  be  the  generalized  inner  product  space  defined  as 
above,  where  m  <  n.  Then  V  h  G  S,  the  orthogonal  projection  of  /i  into  Sm  exists, 
and  is  given  by 

ym  h 

where  /ij(l  <  i  <  m)  are  defined  as  above. 

Proof.  We  are  only  going  to  show  that  hi,  h2  are  the  orthogonal  projections  of  h 
into  E\  and  €2  respectively.  The  rest  can  be  proved  similarly. 

(1)  .  I Xj]  is  the  orthgonal  projection  of  h  into  S\.  In  fact,  V  T^=xgj{Xj)  G 
Si,  we  have 

<h-H^lE[h\Xim=1gj(Xj)  >e 
=  £?=1  <  KgjiXj)  >e  -EtiEJ^a  <  tf^UPO)  >, 
=        <  E^U^)  >,  -E?=1  <  E^Xjlg^Xj)  >9 

=  0, 

since  {ArJ"=1  are  independent.  So  by  Theorem  2.2.1,  we  know  that  Y%=lE[h\Xi]  is 
the  orthgonal  projection  of  h  into  S\. 

(2)  .  Next  we  show  that 

Ztl<l2{E[h\Xtl,Xi2}  -  E[h\Xh]  -  E[h\Xi2]}  +  X?=1E[h\Xi] 
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is  the  orthogonal  projection  of  h  into  £2.  If,  V  ^j1<jagj1j2(^jn ^h)  e 

<  h  —  hi  —  hi,T,jl<j2gjuj.2(Xj1,Xj2)  >g 

=  Sji<j-2  <  higjtjziXj^Xjz)  >g  -SJ1<j2  <  h2,gjlj2{Xjl,Xj2)  >0  - 

Eil<i2E<1<i2{<  E[h\XiltXi3]  -  E[h\Xn]  -  E[h\Xl2l  ghA{Xh,Xh)  >e  - 

Eil<i2E?=1  <  EihlX&g^X^Xj,)  >e  (2.6.32) 
Note  that  if  i\  ^  ji  and  i2  ^  j25  then 

<  E\h\Xi„Xd  -  E[h\Xh]  -  E[h\Xl2],  gjuj2{Xjl,Xj2)  >,=  0, 

by  the  independence  of  {Arj}"=1.  If  i\  —  j\  and  i2  ^  j2  or  i\  ^  j'j  and  i2  =  j2,  then 

<  E[h\X^Xl2]  -  E[h\Xix]  -  E[h\Xi2],  gJuj2(Xn,XJ2)  >e=  0. 
Also  if  i  ^  ji  and  i  ^  j2,  then 

<E[/i|Al],^lj2(Ajl,A72)>,=  0. 
Thus  from  the  above  equation,  we  get  that 

<  h  -  h2  -  hx,Y,jl<j2gjuj2(Xj1,Xj2)  >g 

=  Zji<jA<  E[h\Xjl}  + E[h\Xj2],gjuj2(Xn,Xj2)  >e}- 
^ji<j2^ieii,j2  <  E[h\Xi],  gjuj2(XJ1,  Xj2)  >0 
=  0. 

Hence  h2  +  hi  is  the  orthogonal  projection  of  h  into  E2. 

As  a  consequence  of  the  above  theorem,  we  have  the  following  result. 
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Corollary  2.6.1.  Use  the  same  notation  as  above,  then  {hx, . . . ,  hn}  are  orthog- 
onal to  each  other. 

Proof.  For  all  I  <  mx  <  m2  <  n,  since  Y^J^hi  and  E-^/ij  are  the  orthogonal 
projections  of  h  into  £m2-i  and  £m2  respectively,  thus 

u  _  yrn2  u  _  ym2-U. 
am2  —  ?->i=Vli      ^1=1  "'i 

is  orthgonal  to  £m2-\.  Hence  hm2  is  orthogonal  to  {hi,...,  /im2_i},  since 

{hi,...,  /im2_i}  c  £m-2-i- 

Theorem  15  generalizes  the  ANOVA  decomposition  for  statistics  proved  by  Efron 
and  Stein  (1981),  to  the  estimating  function  case. 

Also  as  another  consequence  of  the  orthogonal  decomposition  theorem,  we  have 
the  following  variance  decomposition  result. 

Corollary  2.6.2.  Use  the  same  notation  as  above,  we  have  that 

Vare(h)  =  ^=iVare(hi). 


CHAPTER  3 
THE  GEOMETRY  OF  ESTIMATING  FUNCTIONS  II 

3.1  Introduction 

In  Chapter  2,  the  notion  of  generalized  inner  product  spaces  is  introduced  to 
study  optimal  estimating  functions  without  nuisance  parameters.  It  was  shown  that 
the  orthogonal  projection  of  the  score  function  into  a  linear  subspace  of  estimating 
functions  was  optimal  in  that  subspace,  and  a  general  method  for  the  contruction 
of  such  orthogonal  projections  was  also  given.  As  applications,  both  frequentist  and 
Bayesian  optimal  estimating  functions  were  found  including  as  special  cases  some  of 
the  frequentist  and  Bayesian  results  derived  earlier. 

In  this  chapter,  we  extend  the  results  of  the  previous  chapter  to  find  optimal 
estimating  functions  in  the  presence  of  nuisance  parameters.  First  in  Section  3.2, 
we  derive  some  simple  extension  of  the  basic  geometric  results  of  Chapter  2.  Next, 
in  Section  3.3,  we  derive  a  general  result  on  global  optimal  estimating  functions 
which  extends  the  results  of  Godambe  and  Thompson  (1974)  and  Godambe  (1976) 
to  the  multiparameter  case.  The  general  result  is  also  used  to  study  the  geometry  of 
conditional  and  marginal  inference  including  as  special  cases  some  of  the  results  of 
Bhapkar  (1989,  1991). 
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In  Section  3.4,  a  general  result  on  locally  optimal  estimating  functions  is  found, 
and  is  used  to  generalize  (1)  Lindsay's  (1982)  result  on  the  local  optimality  of  con- 
ditional score  functions,  (2)  Godambe's  (1985)  result  on  estimation  for  stochastic 
processes,  and  (3)  Murphy  and  Li's  (1995)  result  on  projected  partial  likelihood. 

Finally,  in  Section  3.5,  we  derive  optimal  conditional  estimating  functions.  As 
an  application,  we  generalize  the  results  of  Godambe  and  Thompson  (1989)  in  the 
presence  of  nuisance  parameters. 

3.2    Properties  of  Orthogonal  Projections 

The  following  result  demonstrates  that  the  operation  of  orthogonal  projection  is 
compatible  with  linear  operations  in  a  generalized  inner  product  space. 

Proposition  3.2.1.  Let  (L,  <  ., .  >)  be  a  generalized  inner  product  space,  and 
let  L0  be  a  subspace  of  L.  If  S\,  s2  G  L,  and  the  orthogonal  projection  </,  of  Si  into  L0 
exists  for  i  =  1,  2,  then 

0)  9\  +  92  is  the  orthogonal  projection  of  si  +  s2  into  Lq; 

(ii)  for  any  matrix  N,  N  gi  is  the  orthgonal  projection  of  N  si  into  N  L0. 

Proof.  From  Theorem  2.2.1,  g*  is  the  orthogonal  projection  of  s  into  L0  if  and 
only  if 

<s-g*,g>=  0, 

for  all  g  G  L0. 

(i)  For  any  g  e  L0, 

<  si  +  s2  -  g\  -  g2,g  >=<  sx  -  gug  >  +  <  s2  -  g2,g  >  =0; 

so  gi  +  g2  is  the  orthogonal  projection  of  S]  +  s2  into  LQ. 

(ii)  For  any  g  e  N  L0, 

<  N  ai-N  gug  >=<  N  fa-g^g  > 
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=  N  <  si-gug  >=  0; 

so  N  g\  is  the  orthgonal  projection  of  N  S\  into  L0. 

The  following  result  is  a  slight  generalization  of  Theorem  2.2.3 
Theorem  3.2.1.  Let  (L,<  .,.  >)  be  a  generalized  inner  product  space,  and  let 
Lq  be  a  subspace  of  L.  For  any  fixed  s  G  L,  and  any  invertible  matrix  N,  suppose 
that  the  orthogonal  projection  g*  of  N  s  into  Lq  exists.  Consider  the  function 

Ig  =<  g,s  >l<  g,g  >"<  g,s  >  . 

Then 

h-  ~  h 

is  non  negative  definite,  for  all  g  G  L0. 

Proof.  Since  g*  is  the  orthogonal  projection  of  N  s  into  Lq  and  iV  is  invertible, 
N~x  g*  is  the  orthogonal  projection  of  s  into  iV-1  L0,  from  part  (ii)  of  Proposition  1. 
Hence,  from  part  (b)  of  Theorem  2.2.3, 

I N~ 1  g*  —  lN-lg  (3.2.1) 

is  non  negative  definite  for  all  g  G  Lq.  But  for  any  g, 

/jv-i  g  =<  AT1  g,s>l<  N~l  g,N~l  g  >  <  N~'  g,s> 

=<  g,s  >l  (N1)-1^-1  <g,g>  (N1)^^1  <  g,a  > 
=<  g,s>'<  g,g  >"<  g,s  >=  Ig. 
Hence,  the  result  follows  from  (3.2.1). 

3.3    Global  Qptimality  of  Estimating  Functions 

In  this  section,  a  general  result  about  global  optimality  of  estimating  functions 
in  the  presence  of  nuisance  parameters  will  be  proved.  As  easy  consequences  of  this 
result,  some  results  of  Godambe  and  Thompson  (1974),  and  Godambe  (1976)  are 
found.  Further,  the  geometry  of  conditional  and  marginal  inferences  will  be  explored. 
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3.3.1    The  General  Result 

Suppose  X  is  a  sample  space,  6  =  61  x  02  is  the  parameter  space,  with  0;  C 
Rd'(i  =  1,2).  Let  9  =  {9l,92),9l(e  6t)  be  the  parameter  of  interest,  and  92(e  92) 
be  the  nuisance  parameter.  Consider  the  function 

g  :  X  x  6i  — ►  Rd\ 

where  g  satisfies  the  following  conditions: 

(I)  E[g\6]  =  0,  for  all  9  G  6; 

(II)  for  almost  all  x,  J^-  exists,  for  all  9  G  6; 

(III)  /  g  pdji  is  differentiate  with  respect  to  9X ,  and  differentiation  can  be  taken 
under  the  integral  sign; 

(IV)  E[§fc\9]  is  invertible. 

The  functions  which  satisfy  conditions  (I)  -  (IV)  are  called  regular  estimating 
functions  with  respect  to  9\. 

Let  L  denote  the  space  of  all  regular  unbiased  estimating  functions.  For  gi,  g2  €  L. 
define  the  family  of  generalized  inner  products  of  gi,g2  as 

<  gug2  >e=  E[gi(X,9)g2(X,9y\9],         V  9  €  O.  (3.3.2) 

Also  we  shall  denote  by  s  the  score  function  of  a  parametric  family  of  distributions 
with  respect  to  9X.  We  assume  also  that  the  score  vector  is  regular  in  the  sense 
described  in  (I)  to  (IV). 

Definition  3.3.1.  Let  (L,  <  ., .  >g)  be  the  family  of  generalized  inner  product 
spaces,  and  let  Lq  be  a  subspace  of  L.  For  any  g  G  L0,  let 

=  E[^-\9Y  <  g,g>^  E[^-\9]  (3.3.3) 

An  element  g*  G  L0  is  said  to  be  an  optimal  estimating  function  in  L0  if 

I9.(0)-I9(9) 
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is  n.  n.  d.,  for  all  g  G  L0  and  9  e  Q. 

In  the  rest  of  this  section,  unless  otherwise  stated,  we  shall  assume  the  following 
regularity  condition  for  estimating  functions,  which  basically  involves  the  interchange 
of  differentiation  and  expectation. 

(11).  For  any  g  E  L, 

E[fQ\9]  =  -E[g  s'W.  (3.3.4) 

Thus  combining  (3.3.3)  and  (3.3.4),  Ig(9)  is  the  information  matrix  of  g  with 
respect  to  s. 

We  now  state  the  main  result  of  this  section,  which  is  an  immediate  consequence 
of  Theorem  2.2.3. 

Theorem  3.3.1.  Let  s  =  dlj$P,  M{6)  be  an  invertible  matrix  valued  function, 
and  L0  a  subspace  of  L.  If  the  orthogonal  projection  g*  of  M(9)  s  into  LQ  exists,  then 
g*  is  optimal  in  L0. 

As  an  easy  consequence  of  Theorem  3.3.1,  the  following  result  generalizes  the  re- 
sults due  to  Godambe  (1976),  Godambe  and  Thompson  (1974)  to  the  multiparameter 

case. 

Corollary  3.3.1.  Suppose  there  exists  g  :  X  x  0  — Rdl  such  that 

g*  =  M{6)  s  +  g€LQ,  (3.3.5) 

and  g  is  orthogonal  to  every  element  in  Lq,  then  g*  is  optimal  in  Lq. 

Proof.  Since  g  is  orthogonal  to  every  element  in  L0,  and  g*  G  L0,  g*  is  the 
orthogonal  projection  of  M(9)  s  into  L0.  The  optimality  of  g*  in  L0  is  immediate 
from  Theorem  3.3.1. 

Note  that  the  above  corollary  generalizes  the  main  result  in  Godambe  and  Thomp- 
son (1974)  to  the  multiparameter  case.  It  also  provides  a  geometric  explanation  of 
equation  (5)  in  Godambe  and  Thompson  (1974).  To  see  this,  suppose  there  exist 
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matrices  {M(0),Mj (#)}*_ j  of  appropriate  dimensions  such  that  M{9)  is  invertible 
and 

Then  g*  is  the  orthogonal  projection  of  M(9)  s  into  L0. 

For  d\  =  d,2  =  1  and  k  =  2,  the  above  result  reduces  to  the  main  result  in  Godambe 
and  Thompson  (1974). 

Next  we  use  Theorem  3.3.1  to  study  the  geometric  ideas  behind  conditional  and 
marginal  inferences. 

3.3.2    Geometry  of  Conditional  Inferences 


In  this  subsection,  we  study  the  geometry  of  conditional  inference.  As  an  easy 
consequence  of  this  geometric  approach,  some  of  the  results  due  to  Bhapkar  (1989, 
1991)  follow  easily. 

First  use  the  identity 

E[^-\6)  =  -  <  gu  s6l  >e, 
where  sel  =  a^p,  sel  may  involves  both  8X  and  #2-  The  information  of  gi  is  then 

hAe^e)  =<  0i>s«i  >9<  9u9i  >eY<  9us&1  >e  ■ 

Let  L  denote  the  space  of  all  estimating  functions  which  satisfy  conditions  (I)  - 
(IV)  of  Section  3.3.1.  Let  L0  be  a  subspace  of  L. 

Following  Bhapkar  (1989,  1991a),  suppose  statistic  (S,U)  is  jointly  sufficient  for 
the  family  {pg  :  6  G  0},  and  furthermore,  suppose  U  satisfies  the  following  condition 
C: 
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(C):  The  conditional  distribution  of  S,  given  u  =  U(X),  dependes  on  9  only 
through  #1,  for  almost  all  u,  that  is  S  is  sufficient  for  the  nuisance  parameter  92- 

Denote  by  h(s;6i\u)  the  conditional  pdf  of  S  =  S(X),  given  u. 

Definition  3.3.2.  A  statistic  U  =  U(X)  is  said  to  be  partially  ancillary  for  6\  in 
the  complete  sense  if 

(i)  U  satisfies  requirement  C; 

(ii)  the  family  {p%  :  62  G  62}  of  distributions  of  U  for  fixed  61  is  complete  for 
every  6\  G  Q\. 

A  statistic  U  =  U(X)  is  said  to  be  partially  ancillary  for  9\  in  the  weak  sense  if 

(i)  U  satisfies  requirement  C; 

(ii)  the  marginal  distribution  of  U  depends  on  9  only  through  a  parameteric  func- 
tion 5  =  5(6)  (5  is  assumed  to  be  differentiable)  such  that  (6U5)  is  a  one-to-one 
function  of  6. 

Letting  pLdl  be  the  pdf  of  U  and 


the  following  theorem  connects  Theorem  3.3.1  and  Bhapkar's  (1989,  1991a)  results. 
Take  L  =  Lq. 

Theorem  3.3.2.  If  the  statistic  U  =  U(X)  is  partially  ancillary  for  6\  in  the 
complete  sense  or  in  the  weak  sense,  then  lc(x;6\)  is  the  orthogonal  projection  of  sgl 
into  L. 

Proof.  We  are  going  to  prove  this  result  in  two  cases. 
Case  1.  U  is  partially  ancillary  for  9X  in  the  complete  sense. 
Note  that 


lc(x\9x) 


d\ogh(s,9i\u) 
d9x 


s«i  =  h(x\9x)  + 


d9x 
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We  only  need  to  show  that  V  gx  G  L,  9  G  9 


But 


dex 


501 


=  E{E[9l  (d~^)%  U}\9} 

=  E{E[gx\9,   U]  i^f-YlO}. 

Since  E{E[gx\9,  U}\9}  =  0,  by  the  completeness  of  {p%  :  92  G  02}  for  fixed  9\  G  Q%, 
E[gi\9,   U]  =  0  almost  everywhere  for  fixed  9\.  This  implies  that 


<01'- aZ — >0=°>  V0G6. 
o9\ 


Thus  /c(x;0i)  is  the  orthogonal  projection  of  sg1  into  L. 
Case  2.  C/  is  partially  ancillary  for  9\  in  the  weak  sense. 
Again  note  that 

dlogPe 


s0l  =  le(x;0i)  + 


Now  since  the  map  (9i,92)  — >  (6i,5(8))  is  a  one-to-one  map,  the  matrix 


r  85_ 

ldl  60  y 

0  ^~ 

U  802  J 


is  invertible.  It  implies  that  ^  is  invertible.  Note  that 


dlog/#  _  dlog/#  — 


d9x 


d82      yd92'  ddi 
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se2 


D(9), 


where  D(9)  =  f^.  Thus  we  only  need  to  show  that  V  gx  e  L,6  e  0, 


<  9i, 


dd2 


>o=0. 


But  this  follows  easily  by  differentiating 


E\gi\0]=0 


with  respect  to  02,  and  using  the  regularity  conditions. 

By  combining  Theorems  3.3.1  and  3.3.2,  we  get  the  following  result  due  to  Bhapkar 
(1989,  1991a). 

Corollary  3.3.2.  With  the  notation  as  above,  lc(x;9i)  is  optimal  in  L  if  either 
U  is  partially  ancillary  for  9\  in  the  complete  sense  or  in  the  weak  sense,  that  is 


Note  that  from  the  proof  of  Theorem  3.3.2,  the  condition  of  partial  ancillarity  for 
6\  in  either  the  complete  sense  or  in  weak  sense  guarantees  that  the  conditional  score 
is  the  orthogonal  projection  of  the  score  function  with  respect  to  9\  into  L. 

3.3.3    Geometry  of  Marginal  Inference 

In  this  subsection,  we  study  the  geometry  of  marginal  inference.  As  an  easy 
consequence  of  our  approach,  the  optimality  result  of  Bhapkar  (1989)  and  Lloyd 
(1987)  on  marginal  inference  will  follow  easily.  Assume  that 

(M):  the  distribution  of  statistic  S  =  S(X)  depends  on  9  only  through  9\. 

Definition  3.3.3.  A  statistic  S  =  S(X)  is  said  to  be  partially  sufficient  for  9\  in 
the  complete  sense  if 


I9(9)<IM, 


V9eS,geL. 
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(i)  S  satisfies  condition  (M); 

(ii)  given  s  =  S(X),  the  family  {p^s  :  92  €  62}  of  the  conditional  distributions  of 
U  for  fixed  9\  is  complete  for  almost  all  s,  and  for  every  9\  E  Q\. 

A  statistic  S  =  S(X)  is  said  to  be  partially  sufficient  for  9\  in  the  weak  sense  if 

(i)  S  satisfies  condition  (M); 

(ii)  the  conditional  distribution  of  U,  given  s  =  S(X),  depends  on  9  only  through 
a  parameteric  function  5  =  5(9)  (5  is  assumed  to  be  differentiable)  such  that  (9\,6) 
is  a  one-to-one  function  of  9. 

If  S  =  S(X)  is  partially  sufficient  for  9X  in  the  complete  (  or  weak)  sense.  Let 

u\s 

p# 1   denote  the  conditional  pdf  of  U  given  S  =  s,  and 


The  main  result  of  this  subsection  is  given  in  the  following  theorem. 

Theorem  3.3.3.  If  the  statistic  S  =  S(X)  is  partially  sufficient  for  9\  in  the 
complete  sense  or  in  the  weak  sense,  then  lm(x;  9\)  is  the  orthogonal  projection  of  sgl 
into  L. 

Proof.  We  are  going  to  prove  this  result  in  two  cases. 
Case  1.  5  is  partially  sufficient  for  9\  in  the  complete  sense. 
Note  that 


90i 


dlogp)) 


We  only  need  to  show  that  for  any  g\  6  L,  9  6  6 


39, 


W\s) 


<9i 


But 


d\ogp\ 


U\s 


89, 


,(U\s) 


<  9u 


>9=E[gi  ( 


Y\o] 
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=  E{E[9l  S\\9) 

=E{E[9x\e,  s\  (d-^f^ym. 

Since  E{E[gx\6,  S}\9}  =  0,  by  the  completeness  of  {p{euls)  :  62  e  02}  for  fixed 
61  E  61,  E[gi\6,   S]  =  0  for  fixed  Bx.  This  implies  that 


d\ogpue  ^  n 


V  #  E  (-). 


Thus  /^(:r;0i)  is  the  orthogonal  projection  of      into  L. 
Case  2.  5  is  partially  sufficient  for  9\  in  the  weak  sense. 
Again  note  that 

a,  {U\s) 


oe1 


Now  since  the  map  (61,62)  — >  (#1, 6(6))  is  a  one-to-one  map,  the  matrix 


T  dS  1 
0  — 


is  invertible.  This  implies  that  Jj^  is  invertible.  Hence 


aiogp^|s) 


where  N(6)  =  Thus  we  only  need  to  show  that  V  gx  e  L,  6  e  0, 


a,  (f/|s) 

<5l'^T->e=0- 
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But  this  follows  easily  by  differentiating 

E\g1\9]  =  0 

with  respect  to  92,  and  using  the  regularity  conditions.  Thus  lm(x;9i)  is  the  orthog- 
onal projection  of  sg1  into  L.  This  completes  the  proof. 

By  combining  Theorem  3.3.1  and  3.3.3,  we  get  Ig(9)  <  Itm(6)  for  all  9  e  e.g  E  L, 
which  includes  the  results  of  Bhapkar(1989,  1991a)  and  Lloyd  (1987)  as  a  corollary. 

Corollary  3.3.3. 

(1)  if  S  is  partially  ancillary  for  9\  in  the  complete  sense,  then  lm(x;  9\)  is  optimal 
in  L; 

(2)  if  S  is  partially  ancillary  for  9\  in  the  weak  sense,  then  lm{x\6{)  is  optimal  in 

L. 

Proof.  In  both  cases,  lm(x;9i)  is  the  orthogonal  projection  of  sgx  into  L.  So  the 
inequality 

ig(9)<iM,       veee,  <7eL, 

follows  from  Theorem  3.3.1. 

Note  that  (1)  of  the  above  corollary  is  the  main  result  of  Lloyd  (1987)  and  (2)  is 
due  to  Bhapkar  (1989,  1991a). 

3.4    Locally  Optimal  Estimating  Functions 

In  this  section,  a  general  result  about  locally  optimality  of  estimating  functions 
in  the  presence  of  nuisance  parameters  will  be  proved.  As  easy  consequences  of  this 
result,  some  results  of  Lindsay  (1982),  Godambe  (1985),  Murphy  and  Li  (1995)  will 
be  studied. 
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3.4.1    A  General  Result 

In  this  subsection,  we  study  the  local  optimal  estimating  functions  in  the  pres- 
nece  of  nuisance  parameters.  We  first  introduce  the  space  of  estimating  functions  of 
interest. 

Let  X  be  a  sample  space,  0  =  Qi  x  92  the  dx  +  d2  dimensional  parameter  space, 
with  6t  C  Rdi(i  =  1,2).  A  function 

g  :  X  x  0  — ► 

is  said  to  be  an  unbiased  estimating  function  if 

E[g(X,9)\9)  =  0,  W={e1,62)ee. 

An  estimating  function  g  is  said  to  be  regular  if  it  satisfies  conditions  (I)  -  (IV)  as  in 
Section  3.3.1. 

Let  L  denote  the  space  of  all  regular  unbiased  estimating  functions  from  ^xO 
to  .  Also  the  score  vector  sq1  is  assumed  to  be  regular  in  the  sense  described  in 
(i)  and  (ii). 

Definition  3.4.1.  Let  (L,  <  .,.  >#)  be  the  family  of  generalized  inner  product 
spaces,  and  let  L0  be  a  subspace  of  L.  For  any  g  e  L0,  the  information  function  of  g 
is  defined  by 

m  =  E[^\e)*<g,g>?E\^L\e\  (3.4.6) 
An  element  g*  €  L0  is  said  to  be  a  locally  optimal  estimating  function  at  92  =  #20  if 

Ig.(6l,02o)  -  Ig{9\,02o) 

is  n.  n.  d.,  for  all  g  G  L0  and  9i  E  Q\. 

The  following  is  the  main  result  of  this  section. 
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Theorem  3.4.1.  Let  L  be  the  space  of  all  regular  unbiased  estimating  functions 

g:XxO  — ►  Rdl . 

Let  Lq  be  subspace  of  L.  If  g*  is  the  orthogonal  projection  of  s  into  L0  with  respect 
to  the  generalized  inner  products  <  ., .  >g  with  92  =  620,  then  g*  is  locally  optimal  in 
L0,  that  is  for  any  fixed  #20  G  @25 

Ig' 620)  >  Ig(6\,62o), 

for  all  #1  G  0i.  Also  the  local  optimal  estimating  function  in  L0  is  unique  in  the 
following  sense:  if  g  G  L0,  then  Ig. (61,620)  —  Ig(6i,62o)  for  all  6X  G  0i  if  and  only 
if  there  exists  an  invertible  matrix  valued  function  iV  :  0j  x  {#20}  — ►  -^dixdi  such 
that  for  all  6X  G  0i, 

g*(X;6x,62o)  =  N(6X,620)  g(X;dx,d2o), 

with  probability  1  with  respect  to  Peue20- 

Proof.  This  follows  easily  from  Theorem  2.2.4  and  3.3.1. 

Next  we  apply  Theorem  3.4.1  to  generalize  the  results  in  three  different  cases: 
(1)  Lindsay's  (1982)  result  on  the  local  optimality  of  conditional  score  functions;  (2) 
Godambe's  (1985)  result  on  the  estimation  in  stochastic  processes;  (3)  Murphy  and 
Li's  (1995)  result  on  the  projected  partial  likelihood. 

3.4.2    Local  Optimality  of  Conditional  Score  Functions 

Suppose  that  (Xx, . . . ,  Xn)  =  X(n)  is  a  sequence  of  possibly  dependent  observations 
with  pdf 

f(X(ny,e)  =  f1(X(1);d)f2(X{2)\Xw;e)  ...  fn(X{n)\X(n_x);6).  (3.4.7) 
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Let  Ui  =  ^-  log/j,  and  Sy(0i)  —  Sj(X^y9i)  be  minimal  sufficient  for  02  with  #i  fixed 
in  pdf  fj,  let 

W^  =  tO-^|5ilXy_1)],  (3.4.8) 

for  j  G  {l,...,n}.  The  sequence  {5j(^i)}j=i  is  called  sequentially  complete  if  for 
each  A;  from  1  to  n,  the  system  of  equalities 

E[H(Sk,X{k-1);91)]=0  (3.4.9) 

for  all  9  with  #i  fixed  implies  that  H(Sk,  X^-\)]9\)  is  a  constant  in  Sk  with  proba- 
bility one,  that  is,  H(Sk,  Xik-i)',  9\)  does  not  depend  on  Sk-  The  following  is  a  slight 
generalization  of  the  main  result  in  Lindsay  (1982). 

Theorem  3.4.2.  Assume  sequential  completeness  and  E[Ui\X^^i)]  =  0  for  all 
i  =  1, . . . ,  n.  For  fixed  920  €  02,  consider  unbiased  estimating  function 

h:  X  xQ^  {920}  ^  Rdl,  (3.4.10) 

Let  LQ  be  the  subspace,  which  consists  of  all  unbiased  estimating  functions  from 
X  x  0i  x  {020 }  into  RdK  Then 

(a)  W(#i,02o)  =  ^i^iWi  is  the  orthogonal  projection  of  s$1  into  L0  with  respect 
to  the  generalized  inner  product  <  ., .  >eue20', 

(b)  W(0i,#2o)  is  optimal  in  Lq,  and  the  optimal  element  in  L0  is  unique  in  the 
following  sense:  if  g  G  L0  and  Ig(9i,920)  =  Av(#i,02o)  for  all  9\  £  0i,  then  there 
exists  an  invertible  matrix  valued  function  iV  :  0!  x  {920}  — >  MdlXdl  such  that  for 
all  By  e  0i 

W  =  N(9U92Q)  g, 

with  probability  one  with  respect  to  Peue20- 

Proof.  First  note  that,  for  any  H  G  L0,  consider  the  decomposition 

H  =  Hn  +  ffn_i  +  ...  +  #!  +  #„, 
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where 

Hn{X(n)\6\)  =H  -  E[H\Sn,  X(„_i)], 
Hn-i(Sn,  Ar(n_i);  #i)  =  £,[if|5n,  Ar(n_]j]  -  i^i/IS^-i,  Ar(„_2)], 
^(SWi,  =  F[ff|5fc+i,  A'(fc)]  -  E[H\Sk, X(k-i)], 

for  all  A;  G  {1,2, ...,n},  and  7/0  =  By  the  sequential  completeness  of 

{Sj(Qi)Yj=u  Hk(Sk+i,X(ky,6i)  does  not  depend  on  Sk+i,  that  is  Hk(Sk+i,  A(fc);  6X)  = 

Hk(X(ic)', 

(a)  Since 

s9l  =  ^=1Uj  =  W  +  ^m\Si,  Aj-i], 

it  suffices  to  show  that 

<  E[Uj\Sj,  A'j_i],  Hk(Sk+i,  X(ky,  #1)  >g=  0, 

for  all  j,  k  G  {1, . . . ,  n).  But  this  follows  from  an  easy  conditioning  argument. 

(b)  This  follows  from  part  (a)  and  Theorem  3.4.1. 

Note  that  part  (a)  of  Theorem  3.4.2  is  a  restatement  of  the  main  result  in  Lindsay 
(1982). 

3.4.3    Locally  Optimal  Estimating  Functions  for  Stochastic  Processes 

In  this  subsection,  we  generalize  the  results  of  Godambe  (1985)  to  the  case  where 
there  are  nuisance  parameters.  As  a  special  case,  we  get  generalized  estimating  equa- 
tions in  the  presence  of  nuisance  parameters. 

Let  {A1?  A2, . . . ,  Xn}  be  a  discrete  stochastic  process,  0.,  C  Rdj  (j  =  1,  2)  be  open 
sets.  Let  hi  be  a  Rdl  valued  function  of  X\, . . . ,  Aj  and  9,  which  satisfies  for  fixed 

#20  G  02, 

Ei-1[hi(Xl,...,Xi;e)\61,e2O]  =  0,     (t=  l,...,n,   9}  G  0J.  (3.4.11) 
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In  the  above,  E{-\  denotes  the  conditional  expectation  conditioning  on  the  first  i  —  1 
variables,  namely,  X\, . . .  ,Xi-\.  Let 

L0  =  {g  ■   9=  S"=i^»-i  hi}, 

where  A^i  is  a  Mkxk  valued  function  of  X\, . . . ,  AVi  and  6X,  for  all  t  €  {1, ... ,  n). 

The  following  theorem,  which  generalizes  the  result  of  Godambe  (1985)  on  optimal 
estimating  functions  for  stochastic  processes. 

Theorem  3.4.3.  Let  62  =  02o,  and  suppose  hi  satisfies  the  regularity  condition 
(K).  Let 

A*  =  E^i^-lOu  fco]'  Et-i[hi  h\\eu  e2Q]-1    Vie  {1,2,..., n}, 

and 

g*  =  E?=1A*  hu 

then  the  following  conclusions  hold: 

(a)  ,  g*  is  the  orthogonal  projection  of  sfll  into  L0  with  respect  to  the  generalized 
inner  product  <  ., .  >9ue20- 

(b)  .  g*  is  a  locally  optimal  estimating  function  in  L0,  i.  e., 

^(#i,#2o)  <  Jff„(01)02o), 

for  all  g  (z  Lo  and  #1  G  @i- 

(c)  .  If  .9  e  L0  and  gl\9]  is  invertible,  then  Ig{61,92o)  =  /9,(#i,  6>2o),  V0i  €  ©i 
if  and  only  if  there  exists  an  invertible  matrix  function  N  :  ©i  x  {#20}  — ►  -^fcxfc  such 
that  for  any  6\  €  ©i, 

<7»(Xi, . . . ,  Xn;  61, 620)  =  N(9U  920)  g(X\, ... ,  Xn;  0i,  920), 

with  probability  1  with  respect  to  Poi,o-20- 
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Proof,  (a).  For  any  g  =  S"=1Aj  hi  £  L0,   0X  e  0i, 

<  s9l  -  g*,g  >(eu02o)=<  se^9  >(.el,e20)  ~  <  9*,  9  >(eue2o) 

-Xi^EWhitfjA^ew)]  -  Eoi^MSAjK^.flao)].  (3.4.12) 
But  for  i  <  j, 

=  ^Vi-iI^-Kfli^ISi^)}  =  o. 

Similarly,  for  i  >  j, 

E[A*hlti3A%eu62Q)}  =  i). 
Thus  from  equation  (3.4.12),  we  get 

<sel-g\g>e^0=^UE{El-l[^u  0»MWi>  M" 

T^=1E{A^Ei-i[hi  h\\0u  #20]  A\\0U  920}  =  0. 

Hence  g*  is  the  orthogonal  projection  of  s  into  L0. 

Parts  (b)  and  (c)  of  the  theorem  follows  easily  from  part  (a)  and  Theorem  3.4.1. 

As  a  corollary  to  Theorem  3.4.3,  the  generalized  estimating  equations  in  the  pres- 
ence of  nuisance  parameters  for  multivariate  data  can  be  easily  obtained. 

Corollary  3.4.1.  Suppose  that  Aj, . . .  ,Xn  are  independent,  for  fixed  #20  £  ®2> 
for  each  i  £  {1,2, ... ,n}, 

hi :  Xi  x  O  — ►  Rd\ 

with  E[hi(Xi,9)\0\,  #20]  =  0.  Consider  the  subspace  L0  as  that  in  Theorem  3.4.3. 
Then  the  generalized  estimating  equations  determined  by  {hi}  is  given  by 

S?=i4?  ^  =  0, 
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where 

A*  =  E^&9U  02o]'  Ek-Ahi  h^\eu  920}~x     V  i  €  {l,2,...,n}. 

The  above  corollary  provides  a  very  convenient  way  to  construct  generalized  es- 
timating equations.  For  instance,  if  hj  is  chosen  as  linear  (or  quadratic)  function  of 
Xi,  then  the  corresponding  generalized  estimating  equations  reduce  to  the  GEE1  and 
GEE2  studied  by  Liang  ,  Zeger  and  their  associates.  One  may  refer  to  see  Liang  and 
Zeger  (1986),  Liang,  Zeger  and  Qaqish  (1992),  Diggle,  Liang  and  Zeger  (1994). 

3.4.4    Local  Optimality  of  Projected  Partial  Likelihood 

In  this  subsection,  we  generalize  the  result  of  Murphy  and  Li  (1995)  on  projected 
partial  likelihood  to  the  nuisance  parameter  case.  Also  the  application  of  this  result 
to  longitudinal  data  will  be  pointed  out. 

Suppose  that  the  data  consist  of  a  vector  of  observations  X  with  density  /(x;  0i,  82), 
6\  is  the  vector  of  parameters  of  interest,  which  is  finite  dimensional,  and  82  is  the 
vector  of  nuisance  parameters,  which  may  be  infinite  dimensional.  Suppose  there  is 
a  one-to-one  transformation  of  the  data  X  into  a  set  of  variables  Y\,C\, . . . ,  Ym,  Cm  . 
Let 

yfr)  =  (Yi, . . . ,  Yj),     CW  =  (Cu...,Cj),  j  =  l,...,m.  (3.4.13) 

For  instance,  in  survival  analysis,  Y\t . . .  ,Ym  denote  the  lifetime  variables,  and  C\, . . .  ,C, 
the  censoring  variables. 

Note  that  the  joint  density  of  Y^m\  C(m)  can  be  written  as 

m  m 

n/foku-1\yw-1);0iffc)  n/tel^?^1';^),  (3.4.14) 
i=i  j=i 
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where  and  are  arbitrary  constants,  and  are  used  only  for  notational  purposes. 
Then  P(9l)  =  Uj=i f{yj\cU\  yU~l);  #i,  92)  is  called  the  Cox  partial  likelihood. 


Let 


vm  aiog/^lc^-1),^-1);^,^) 


where 


5   _Ej=1  —  •  {6AAO) 

Next  we  introduce  the  subspace  of  unbiased  estimating  functions  which  is  of  in- 
terest in  this  subsection.  This  is  similar  to  the  space  considered  by  Godambe  (1985) 
in  studying  the  foundation  of  finite  sample  estimation  in  stochastic  processes. 

For  any  j  €  {1,  2, ... ,  ra},  consider  estimating  functions 

hj  :  Yu)  x  C{j)  x  6j  — ►  Rdl,  (3.4.16) 

E{hj\yU-1\cU\e]  =  0,  (3.4.17) 

for  all  9  e  0i  x  02. 

For  chosen  {hi}^,  consider  the  space 

L9  =  {g:g  =  T%slAj(0)lhh  (3-4-18) 

where  for  all  j  €  {1, . . .  ,  m},  Aj(9)  is  a  d\  x  d\  matrix,  and  hj  satisfies  (3.4.16)  and 
(3.4.17). 
Let 


_  aiogP  _        aiog/folc^  y^;!^)) 


The  main  result  of  this  subsection  is  the  following. 

Theorem  3.4.4.  For  fixed  02  =  92o,  for  any  i  €•  {1, . . . ,  m},  let 
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and 

g*  =  m^a,. 

Then 

(a)  g*  is  the  orthogonal  projection  of  s*  into  L0,  that  is,  for  any  g  6  L0, 

<  s*  -        >tf1,e20=  0, 

for  all  0  €  0i ; 

(b)  g*  is  locally  optimal  in  L0  at  92  =  92Q; 

(c)  If  g  E  U  and  g*\0]  is  invertible,  then  Ig(9u  920)  =  /9.(0i,02o),  V  0j  €  0j 
if  and  only  if  there  exists  an  invertible  matrix  function  N  :  Qxx  {92o}  — ►  Mkxk  such 
that  for  any  9X  €  €>i, 

g.{X;9l,02O)  =  N(0U92O)  g{X-9u9w), 

with  probability  1  with  respect  to  Peue20- 
Proof.  For  any  j  6  {1, . . . ,  m},  let 

d\ogf{Cj\^-l\y^-9u92) 

then  s  —  s*  =  TSL^j. 

(a)  For  any  g  £  L0,  let  <7j  =  A,-(0)  hj,j  =  l,...,m  be  the  components  in  the 
definition  of  L0.  In  order  to  show  that  E[(s  —  s*)  gl\9]  —  0,  it  suffices  to  prove  that 

E[sjgtjl\e]  =  0, 

for  all  j,  j'  £  {1, . . . ,  m}.  Consider  the  following  three  cases: 
Case  1.  j  >  j'.  Then 

E[8j  g),\0]  =  E{E[Sj\c«-l\yU-V\  g),\9)  =  0, 
since  E[sj\6i-x\y^-^\  =  0. 
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Case  2.  j  =  f.  Then 

E[sjgtj\9]  =  E{sjE\9j\yV-l\cW)t\e}  =  0, 

since  E[g3\y^-l\ c^](  =  0. 
Case  3.  j'  >  j.  Then 

E[Sj  g\,\B\  =  E{s3  EbrfiP-W-^ffl  =  0, 

since  E[gr\yV'-l\cW-V}  =  0. 

Part  (b)  and  (c)  follows  from  Theorem  3.4.1  and  part  (a). 

Note  that  Murphy  and  Li  (1995)  studied  projected  partial  likelihood  in  the  case 
dx  =  1  ,  the  absence  of  nuisance  parameters  and  when  all  the  C^'s  are  empty. 
Similar  to  Murphy  and  Li's  comments,  because  of  the  nested  structure  of  {Y^\  C^}, 
removing  drop-out  factors  in  this  way  will  not  cause  bias  in  the  resulting  partial  score 
function,  as  long  as  the  subjects'  drop-out  depends  only  the  past.  This  is  in  contrast 
to  a  generalized  estimating  equation,  which  is  biased  under  random  drop-out. 

3.5    Optimal  Conditional  Estimating  Functions 

In  this  section,  we  study  optimal  conditional  estimating  functions.  Let  X  be  a 
sample  space,  6  =  Q\  x  02,  &i  C  Rdi,  (i  =  1,2),  and  9\  is  the  parameter  of  interest. 
For  fixed  9i,  assume  that  S(#i)  is  sufficient  for  82.  A  function 

g  :  X  x  0!  — ►  Rd\ 

is  called  a  regular  conditional  unbiased  estimating  function  if 

(1)  E[g\S(9l)]=0,  for  all  9X  e  9i; 

(2)  E[g  gt\S(9i)]  is  positive  definite. 

Consider  the  space  L  of  all  regular  conditional  unbiased  estimating  functions,  a 
family  of  generalized  inner  products  on  L  is  defined  as  follows:  for  any  gi,g2  6  L, 

<  9u92  >s(91)=  E[gx  gtlSfr)].  (3.5.20) 
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For  any  g  G  L,  the  conditional  information  of  g  is  denned  as  follows: 

iMSm  =  ^|5(0,)F  <  9,9  >;<Vo     j^Wi)]-  (3-5.21) 

Definition  3.5.1.  Let  L0  be  a  subspace  of  L,  a  function  <?*  €  L0  is  called  an 
optimal  conditional  estimating  function  in  L0  if 

M0i|S(0i))>/,(0i|S(0i)), 

for  all  g  £  L0. 

The  main  result  in  this  section  is  given  in  the  following  theorem. 
Theorem  3.5.1.  Define 

s  =  -^-\o%f(X\S(91)).  (3.5.22) 

Suppose  Lq  is  a  subspace  of  L,  and  assume  that  the  orthogonal  projection  g*  of  s  into 
L0  exists.  Then 

(a)  g*  is  an  optimal  conditional  estimating  function  in  L0; 

(b)  the  optimal  element  in  L0  is  unique  in  the  sense  that  if  g  £  L0,  then  Ig>  (9i\S(9i)) 
Ig(9i\S(9i)),  for  all  9X  e  0i  if  and  only  if  there  exists  an  invertible  matrix  valued  func- 
tion N  :  X  x  0i  — ►  Mjfcx*  of  the  form  iV(Ar,  #i)  =  W(5(0i))  such  that 

g*(X;0l)  =  N(S(91))g(X;9l). 

Proof,  (a)  Since  E[g\S(9i)\  —  0,  under  the  regularity  condition  given  in  (2.5), 

E[wJs{9l)]  =  ~E[9  s'lsm- 

Thus,  from  the  definition  of  Ig(9i\S(9i)), 

Ig{di\S(6i))  =<  g,s>ls{6l)<  g,g  >s(8l)<  9,s>S(e1)  ■ 
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Since  g*  is  the  orthogonal  projection  of  s  into  L0  with  respect  to  <  .,.  >s(fli),  thus 
the  optimality  of  g*  in  L0  follows  from  Theorem  2.2.3. 
(b)  This  follows  from  Theorem  2.2.4. 

Next  as  applications  of  Theorem  3.5.1,  we  generalize  the  results  of  Godambe  and 
Thompson  (1989)  on  optimal  estimating  functions  into  the  conditional  estimating 
functions  framework. 

To  this  end,  let  X  denote  the  sample  space,  9X  =  (9n, . . . ,  0^,)  be  a  vector  of 
parameters,  hj,  j  =  1, . . . ,  k  be  real  functions  on  X  x  Qx  such  that 

E[hj(X,  9)13(6,),  Xj]  =  Q,     \/9eB,     j  =  1, ....  *, 

where  Xj  be  a  specified  partition  of  X,  j  =  1, . . . ,  k.  We  will  denote 

E[.\S(9l),XJ]  =  EU)[.\S(9l)}. 

Consider  the  class  of  estimating  functions 

Lo  =  {g-  9  = 

where 

gr  =  Yjj=^qjrhj ,     r  =  1, . . . ,  m, 

qjr  :  X  x  ©!  — >  R  being  measurable  with  respect  to  the  partition  Xj  for  j  — 
1, . . . ,  k,  r  —  1, . . . ,  d\. 
Let 

q*~  EU)[h)\S{9x)Y 
for  all  j  =  1, . . . ,  k,  r  =  1, . . . ,  d\,  and 

g;  =  Y.kj=lq*jT  hj,     r  =  l,...,d1. 

The  estimating  functions  hj,  j  =  1, . . . ,  k  are  said  to  be  mutually  orthogonal  if 

E{j)[q*rh3q],rlhr\S(9x)}  =  0.      Vy  /  ./.  r.  r'  =  1,...,^.  (3.5.24) 
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Theorem  3.5.2.  Suppose  {hj}1^^  are  mutually  orthogonal.  Then  the  following 
hold: 

(a)  g*  is  the  orthogonal  projection  of  the  score  function  s  into  L0. 

(b)  g*  is  an  optimal  estimating  function  in  L0. 

(c)  .  If  g  €  L0,  and  E[g  gl\9\  is  invertible,  then  Ig(9)  =  Ig*{9),  V  9  e  6  if  and 
only  if  there  exists  an  invertible  matrix  function  N  :  X  x  0!  — ►  Mkxk  of  the  form 
N(X,6i)  =  N(S{9i))  such  that  for  any  9X  e  Gu 

ST(X;$l)  =  N{S{9l))g(X;$l)i 

with  probability  1  with  respect  to  Pq. 

Proof.  (1).  We  only  need  to  show  that,  V  r  G  {1, . .  .,di},gr  =  Ej=lqjrhj, 

<  s  -  g*,  gr  >s(ol)=  0,  V  9X  e  9i 

that  is 

<  s,gr  >s(0i)=<  9*,9r  >s{0i),     V  9i  <E  0i. 

But 

<  9*r,9r  >s(0,)=  ^kj=l^kf=lE[q*rhJqjlrhr\S{9l)] 

=  £*=14=1£{9;,;  V/ri^ 

=  Xkj=1E{q;rqjrEU)[hp(9l))\S(e1)} 
=  ^UE^rEU){^\S(01)}\S(9l)}. 

Also 

<  s,gr  >s(e1)=  Zhj=lE[qjr  s  hj\S(9i)] 
=  EkJ=lE{qjrEU)[shJ\S(9l)]\S(91)} 

=  ^=lE{qjrEu)&S(9l)]\S(9l)}. 


Thus  g*  is  the  orthogonal  projection  of  the  score  function  into  L0. 
Once  again  (b)  and  (c)  follow  from  part  (a)  and  Theorem  2.2.4. 


CHAPTER  4 

CONVEXITY  AND  ITS  APPLICATIONS  TO  STATISTICS 


4.1  Introduction 


In  this  chapter,  we  first  prove  some  general  results  about  convexity,  and  then 
apply  the  results  to  various  statistical  problems,  which  include  the  theory  of  opti- 
mum experimental  designs,  the  fundamental  theorem  of  mixture  distributions  due  to 
Lindsay  (1983a),  and  the  asymptotic  minimaxity  of  robust  estimation  due  to  Huber. 
Huber  (1964)  proved  an  asymptotic  minimaxity  result  for  estimating  functions  about 
the  location  parameter.  In  this  chapter,  this  fundamental  result  will  be  generalized 
to  general  estimating  functions.  The  geometric  optimality  of  estimating  functions 
proved  in  Chapter  2  will  be  used  to  prove  a  necessary  and  sufficient  condition  for 
the  asymptotic  minimaxity  of  estimating  functions  in  multi-dimensional  parameter 
spaces. 

The  contents  of  this  chapter  are  organized  as  follows:  in  Section  4.2,  a  few  simple 
results  about  matrix  valued  convex  functions  will  be  proved.  Also  we  include  some  of 
the  well  known  results  in  convex  analysis,  such  as  the  Krein-Milman  theorem  about 
extreme  points  of  convex  sets,  and  the  Caratheodory  theorem  about  the  representa- 
tion of  elements  of  a  convex  set  in  a  finite  dimensional  vector  space.  In  Section  4.3, 
the  results  of  Section  4.2  are  applied  to  the  theory  of  optimum  experimental  designs. 
The  fundamental  result  on  optimal  design  theory  is  generalized  to  the  matrix  valued 
case.  In  Section  4.4,  the  results  of  Section  4.2  are  applied  to  the  mixture  distribution 
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situation;  the  fundamental  result  about  mixture  distribution  due  to  Lindsay  (1983a) 
is  an  easy  consequence.  In  Section  4.5,  the  results  of  Section  4.2  and  Chapter  2  will 
be  used  to  generalize  the  classical  asymptotic  minimaxity  result  of  Huber  (1964)  in 
the  estimating  function  framework. 

4.2    Some  Simple  Results  About  Convexity 

Let  L  be  a  linear  space.  A  subset  C  of  L  is  said  to  be  convex  if  for  every  x,y  E  C, 
A  €[0,1], 

Xx  +  (1  -  X)y  €  C. 
A  function  /  :  C  — >  R  is  said  to  be  convex  if  for  any  x,y  E  C,  X  E  [0, 1], 

f(Xx  +  (l-X)y)<Xf(x)  +  (l-X)f(y). 

A  symmetric  matrix-valued  function  N  :  C  — >  Mkxk  (i.  e.,  for  any  x  E  C,  N(x)  is 
a  symmetric  k  x  k  matrix)  is  said  to  be  convex,  if  for  any  x,  y  G  C,  A  G  [0, 1], 

N{Xx  +  (1  -  X)y)  <  \N(x)  +  (1  -  X)N{y), 

where  for  two  k  x  k  matrices  A,B,  A  <  B  means  that  B  —  A  is  nonnegative  (n.  n. 
d.).  In  the  following,  we  only  study  properties  of  matrix  valued  convex  functions, 
since  for  k  =  1,  they  are  reduced  to  the  real  valued  case. 

For  every  x,y  E  C,  consider  the  function  on  [0, 1]  as  follows 

N{X;x,y)  =  N((l-X)x  +  Xy), 

then  N(X;  x,  y)  is  a  convex  function  on  A.  The  directional  derivative  of  N  at  x  in  the 
direction  of  y  is  defined  as 

Af(A;  x,  y)  —  N(0;  x,  y) 
FN(x;  y)  =  hm  -.  4.2.1) 

A->0+  A 
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The  existence  of  the  limit  is  justified  as  follows:  Since 


Ai«(1-£)0  +  £a» 


for  0  <  A]  <  A2  <  1, 


N(Xu  x,  y)  <  (1  -  ^)/V(0;  x,  y)  +  ^iV(A2;  x,  y). 


This  implies 


NjX^x.y)  -N{0;x,y)  <  N(\2;x,y)  -  N(0;x,y)  ^  ^ 

Ai  A2 

that  is  N(*x'y)-N(°'>x<v)  [s  a  nonincreasing  function  of  A  in  (0,1].  Hence  the  limit  in 
(4.2.1)  is  well  defined. 

From  (4.2.1)  and  (4.2.2),  for  a  convex  function  N, 

FN{x-  y)  <  N(l]x,  y)  -  N(0;  x,  y)  =  N(y)  -  N{x),  for  all  x,yeC.  (4.2.3) 
The  following  result  will  be  used  repeatedly  in  the  sequel. 

Theorem  4.2.1.  Suppose  that  N  is  convex,  then  for  x0  G  C  satisfies  that  N(x0)  < 
N(y)  for  all  y  G  C  if  and  only  if 

FN(x0',y)>0,  (4.2.4) 

for  all  y  eC. 

Proof.  Suppose  that  N(x0)  <  N(y)  for  all  y  G  C.  Then  *(**o,y)-jv(0;»0,y)  is 
non-negative  definite.  Hence 

rp  1       ,      y  N(X;xo,y)-N(0;xo,y) 
FN(Xo;y)=  lim  , 

A->0+  A 

is  n.  n.  d.,  for  all  y  G  C. 

Conversely,  if  FN(x0;y)  >  0  for  all  y  G  C,  then  from  (4.2.3), 

N(y)-N(x0)>Fff(x0',y)>0. 
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Thus  N(x0)  <  N(y)  for  all  y  €  C. 

Next  let  L  be  a  locally  convex  vector  space,  and  let  N  be  a  symmetric  matrix 
valued  function;  then  N  is  said  to  be  Gateaux  differentiable  at  x,  if  there  exists  a 
continuous  linear  operator  A  :  L  — ►  Mkxk  such  that 

FN(x;  y)  =  A{y  -  x),  for  all  yeC.  (4.2.5) 

Before  stating  the  next  result,  let  us  recall  one  of  the  well  known  results  from 
functional  analysis. 

Theorem  4.2.2  (Krein-Milman).  Let  L  be  a  locally  convex  vector  space,  and 
let  C  be  a  convex  compact  subset  of  L.  Then 

C  =  conv{ext(C)), 

where  ext(C)  denotes  the  set  of  extreme  points  of  C,  and  conv(A)  denotes  the  closed 
convex  hull  of  A,  it  is  the  smallest  closed  convex  set  containing  A. 

Now  equipped  with  Gateaux  differentiability  and  Krein-Milman  theorem,  we  are 
in  the  position  to  prove  the  following  result. 

Theorem  4.2.3.  Let  L  be  a  locally  convex  vector  space,  and  let  C  be  a  convex 
compact  subset  of  L.  If  N  is  convex  Gateaux  differentiable  at  x0,  then  x0  G  C  satisfies 
N(x0)  <  N(y)  for  all  y  G  C  if  and  only  if 

FN(x0;y)>0,  (4.2.6) 

for  all  y  £  ext(C). 

Proof.  Since  F^{xa;y)  =  A(y  —  x0)  for  some  continuous  linear  opeator  A  for  all 
y  G  C,  from  the  definition  of  Gateaux  differentiability,  F^(xo',y)  >  0,  for  all  y  G  C 
is  equivalent  to  FN(x0;y)  >  0,  for  all  y  G  ext(C).  Thus  Theorem  4.2.3  follows  from 
Theorem  4.2.1. 

Next  the  famous  theorem  of  Caratheodory  about  the  representation  of  elements 
of  convex  set  in  a  finite  dimensional  vector  space  is  presented.  The  present  proof, 
taken  directly  from  Silvey  (1980),  is  included  for  the  sake  of  completeness. 
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Theorem  4.2.4  (Caratheodory).  Let  S  be  a  subset  of  Rn.  Then  every  element 
c  in  conv(S)  can  be  expressed  as  a  convex  combination  of  at  most  n  +  1  elements  of 
S.  If  c  is  in  the  boundary  of  conv(S),  n  +  1  can  be  replaced  by  n. 

Proof.  Let 

S'  =  {(l,x)  :xeS} 

be  a  subset  of  Rn+l,  let  K  be  the  convex  cone  generated  by  S' .  Let  y  £  K,  then  ?/ 
can  be  written  as 

y  =  Aij/i  +  . . .  +  Amj/m, 

where  each  A,  >  0  and  each  y;  €  5".  Suppose  that  the  yt  are  not  linearly  independent. 
Then  there  exists  Hi, . . . ,  /xm,  not  all  zeroes  such  that 

+  . . .  +  /Ltmym  =  0. 

Since  the  first  component  of  each  y$  is  1,  so  +  . . .  +  ^m  =  0.  Hence,  at  least  one  fa 
is  positive.  Let  A  be  the  largest  number  such  that  A/ij  <  Aj,  i  =  1, . . .  ,m;  A  is  finite 
since  at  least  one  fa  is  positive.  Now  let  A'f  =  Aj  —  A/Zj,  then 

y  =  XiVi  +  ■■■  +  KuVm, 

and  at  least  one  X\  —  0.  Thus  y  can  be  expressed  as  a  positive  linear  combination  of 
fewer  than  m  elements  of  S'.  This  argument  can  continue,  until  y  has  been  expressed 
as  a  positive  linear  combination  of  at  most  n  +  1  elements  of  S",  since  more  than 
n  +  1  elements  are  linearly  dependent.  Now  the  first  part  of  the  theorem  follows  by 
applying  the  above  result  to  (l,c)  6  S'. 
Next  suppose  that  y  e  K  and 

J/  =  Ai'f/1  +  .  .  .  +  Xn+lVn+U 

where  each  Aj  >  0  and  the  yi  are  linearly  independent.  Then  y  is  an  interior  point  of 
K.  Thus  any  boundary  point  of  K  can  be  expressed  as  a  positive  linear  combination 
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of  at  most  n  linearly  independent  elements  of  S'.  So  the  second  part  of  the  theorem 
follows. 

Proposition  4.2.1.  If  C  is  a  compact  subset  of  a  locally  convex  vector  space, 
then  conv(C)  is  compact. 

Theorem  4.2.5.  (a)  With  the  same  notation  as  Theorem  4.2.3,  and  the  Gateaux 
differentiablity  of  N  on  C.  The  following  are  equivalent: 

(i)  xq  minimizes  N(x); 

(ii)  x0  maximizes  inf^c  atFN(x;  y)a,  for  any  fc  x  1  real  vector  a; 
(hi)  infyec  atFN(x0 ;  y)a  =  0,  for  any  k  x  1  real  vector  a. 

(b)  If  x0  minimizes  N(x),  then  (xQ,x0)  is  a  saddle  point  of  FN,  that  is, 

FN(xo;yi)  >  0  =  FN(x0;xQ)  >  FN(y2;x0), 

for  all  j/1,2/2  €  C. 

(c)  If  x0  minimizes  N(x),  then  the  support  of  x0  is  contained  in  {y  :  FN(xo;  y)  —  0}. 
More  precisely, 

{yt  eC,x0  =  EfAfj/j,  Aj  >  0,  EjAj  =  1}  C  {?y  :  FN(x0;y)  =  0}. 

Proof,  (a)  First  note  that  from  Gateaux  differentiablity  of  N,  for  any  real  k  x  1 
vector  a,  and  x  G  C, 

inf    atFN(x;y)a  =  inf  atFN(x;y)a, 

y£ext(C)  y€C 

and 

inf  alF^(x;y)a  <  atF^(x;x)a  =  0. 
s/ec 

((i)  (Hi)).  Note  that  minimizes  N(x),  if  and  only  if  for  any  real  k  x  1 
vector  a,  and  y  £  C.  alFN(xo;y)a  >  0.  The  last  inequality  holds  if  and  only  if 
inf^gc  alFN(x0;  y)a  >  0,  for  any  k  x  I  real  vector  a.  This,  in  turn,  is  equivalent  to 
infyeext(c)  atFN{xQ\y)a  =  0,  for  every  k  x  1  real  vector  a. 
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((ii)  (Hi)).  Note  that  xo  maximizes  iniyec  o.1Fn(x;  y)a,  for  every  k  x  1  real 
vector  a,  if  and  only  if  miy€c  a1  Fn(xq;  y)a  >  0,  for  any  k  x  1  real  vector  a.  This  is 
equivalent  to  infy€eif(c)  alFN(x0;  y)a  —  0,  for  every  k  x  1  real  vector  a. 

(b)  This  follows  from  Theorem  4.2.1  and  the  definition  of  F^. 

(c)  If  .To  =  ^iKVii    K  >  0,  S,A(  =  1,  since  N  is  Gateaux  differentiable. 

0  =  FN(x0;  x0)  =  FN(x0;  EfA^) 
=  T,iXiFN(x0;yi). 
Since  FN(x0;  y)  >  0  for  all  y  G  C,  FM(x0;  yt)  =  0  for  all  yt. 

4.3    Theory  of  Optimum  Experimental  Designs 

In  this  section,  the  results  of  the  previous  section  are  applied  to  fixed  optimal 
experimental  designs.  First,  we  formulate  the  problem. 

Let  /  =  (/i, . . . ,  fm)  denote  m  linearly  independent  continuous  functions  on  a  com- 
pact set  X,  and  let  8  =  (9\, . . . ,  6m)  denote  a  vector  of  parameters.  For  each  x  €  X, 
an  experiment  is  performed.  The  outcome  is  a  random  variable  y(x)  with  mean  value 
f(x)l6  =  Y^Lxfi(x)9i,  and  a  variance  a2,  independent  of  x.  The  functions  fx, . . . ,  /m, 
called  the  regression  functions,  are  assumed  to  be  known,  while  9  =  (6i, . . . ,  9m)  and 
a  are  unknown.  An  experimental  design  is  a  probability  measure  //.  defined  on  a 
fixed  cr-algebra  of  subsets  of  X,  which  include  the  one  point  subsets.  In  practice,  the 
experimenter  is  allowed  ./V  uncorrected  observations  and  the  number  of  observations 
that  he  (or  she)  takes  at  each  x  6  X  is  proportional  to  the  measure  /j,.  For  a  given  /i, 
let 

M(/i)  =  ((mI,(/i)))-=1,     m,^)  -  /  ft(x)f3{x)dii(*)-  (4-3-7) 

The  matrix  M(//)  is  called  the  information  matrix  of  the  design  /j,. 

Let  %  denote  the  set  of  all  probability  measures  on  X  with  the  fixed  cr-algebra,  and 
M  =  {M(fj)  :  //  E        (f>  :  M  — ►  Mt-xk  be  a  symmetric  matrix-valued  function.  The 
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problem  of  interest  is  to  determine      which  maximizes  <£(M(/i))  over  all  probability 
measures.  Any  such      will  be  called  ^-optimal. 
Proposition  4.3.1. 

M  =  conv({f{x)f(xf  :  x  G  X}). 
Proof.  Since  M  is  a  convex  set,  and  {f(x)f(x)t  :  x  G  A}  C  M,  so 

conv({f(x)f(xY  :xeX})cM. 
Next  since  A  is  compact,  and  /  is  continuous,  thus 

{/(*)/(*)'  :  x  G  A}  C  A4 

is  compact.  Hence 

conv({f(x)f{xY  :  x  G  X})  =  7Sm({f(x)f(x)*  :  a;  G  #}). 

Also  since  A4  C  conv({f(x)f(x)1  :  :r  G  A'}),  hence 

M  =  cont;({/(x)/(x)*  :  x  G  A}). 

From  the  above  proposition  and  Caratheordory's  theorem,  the  following  is  true. 
Corollary  4.3.1.   For  any  M(/x)  G  M,  there  exists  X{  G  X,i  —  I,...,  I,  I  < 

+  1,  such  that 

where  A;  >  0,  S/=1A8  =  1.  If  M(/i)  is  a  boundary  point  of  M,  the  inequality  involving 
/  can  be  reduced  to  /  <  m(™+1) . 

From  the  practical  point  of  view,  this  corollary  is  extremely  important.  For  it 
means  that  if  4>  is  maximal  at  M*,  then  M*  can  always  be  expressed  as  M (//*),  where 
//„  is  a  discrete  design  measure  supported  by  at  most  m(™+l)  +  i  points. 
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Now  we  are  in  the  position  to  prove  the  fundamental  theorem  in  optimum  design 
theory,  which  is  a  generalization  of  the  result  in  Silvey  (1980)  to  the  matrix  valued 
case. 

Theorem  4.3.1.  (A)  If  0  is  a  concave  function  on  M,  then  M(fi*)  is  ^-optimal 
if  and  only  if 

F^(M(^),M(/i))  <0, 

for  all  fi  6  H; 

(B)  If  0  is  a  concave  function  on  M,  which  is  Gateaux  differentiable  on  M,  then 
M((it)  is  0-optimal  if  and  only  if 

^(M(^),/(*)/(*)')<0, 

for  all  x  e  X; 

(C)  If  4>  is  Gateaux  differentiable  at  M (//*),  and  M(/i*)  is  (f)  optimal,  then 

{xi  G  X  :  M{^)  =  Tii\if(xi)f{xi)\  Xi  >  0,S,A,  =  1} 

C{xeX:F</>(M(n.),f(x)f(x)t)=0}. 

Proof.  They  are  easy  consequences  of  Theorem  4.2.1  -  4.2.3. 
Next  we  apply  Theorem  4.3.1  to  study  the  relationship  between  D  and  G  optimal 
designs. 

The  D-optimality  criterion  is  defined  by  the  criterion  function 

0[M(/*)]  =  log det[M(fi)l  if  det[M(n)}  ^  0 

=  -oo,  if  det[M(n)\  =  0.  (4.3.8) 

li*  €  %  is  said  to  be  D-optimal  if  //,  maximizes  0.  Let  M  denote  the  set  of  all  positive 
definite  matrices,  then  0  has  the  following  properties: 

(a)  (j)  is  continuous  on  M.\ 

(b)  0  is  concave  on  M.  \ 
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(c)  <j>  is  Gateaux  differentiable  at  M\  if  it  is  nonsingular,  and 

F4>(M],M2)  =  tr(M2M[l)-k. 

Proof,  (a)  The  continuity  of  4>  follows  from  the  continuity  of  det. 

(b)  We  want  to  show  that,  for  every  Ae  (0,1),  and  Mi,  M2  G  M, 

(f)[(l  -  X)Ml  +  AM2]  >  (1  -  A)0(Mi)  +  A0(M2). 

This  inequality  is  obvious  if  either  Mi  or  M2  is  singular.  Thus  we  only  need  to  prove 
the  inequality  if  both  Mi  and  M2  are  nonsingular.  From  a  standard  result  from 
matrix  algebra,  there  is  a  nonsingular  matrix  U  such  that 

UMxU*  =  I,         UM2Ul  =  A  =  diag(Xu  . . . ,  Afc). 

Using  the  concavity  of  log, 

0[(1  -  A) Afi  +  AA/2]  =  logde^CT^l  -  A)/  +  AA]£TU} 

>  logdet^  +  Ef^AlogAi 
=  (1  -  A)  logde*f/-2  +  Alogdet(C/-1A£/-") 
=  (1  -  A)0(Mi)  +  A0(M2). 

(c)  For  nonsingular  matrix  Mi,  we  have 

(/.(Mi  +  eM2)  -  (/.(Mi) 

=  log(iet(/  +  eM2Mf1) 
=  log{l  +  (tr{M2M^)}  +  0(e2) 
=  etr{M2M[1)  +  0(e2). 

Thus 

F0(Mj,  M2)  =  *r[(M2  -  Mi) M^1]  =  tr{M2M;1)  -  k.  (4.3.9) 
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The  G-optimality  criterion  is  defined  by  the  criterion  function 


0[M(//)]  =  maxft(x)M-l(fi)f(x) 


if  detM(fi)  /  0, 


=  oo, 


if  detM(n)  =  0. 


(4.3.10) 


A  design      is  said  to  be  G-optimal  if 


4>{M{^)\  <  <fr[M(n)],     for  all  fien. 


Example.  The  equivalence  of  D  and  G  optimal  designs. 

In  this  example,  we  derive  the  famous  equivalence  theorem  due  to  Kiefer  and 
Wolfowitz  (1960)  about  the  D  and  G  optimal  designs. 

Theorem  4.3.2  (Kiefer  and  Wolfowitz).  If  fit  G  H  satisfies  the  condition  that 
M(^i*)  is  nonsingular,  then  fi*  is  D-optimal  if  and  only  if  /i*  is  G-optimal. 

Proof.  (=>).  From  Theorem  4.3.1  and  (4.3.9),  /z*  is  D-optimal  if  and  only  if 


tr{f(x)f\x)M(ii*)~l]  <  k,     for  all  x€  X, 


that  is 


maxft(x)M-l{fim)f(x)  <  k. 


On  the  other  hand,  for  any  ji  G  H  such  that  M(fi)  is  nonsingular, 


tr[M-x{n)M(n)]  =  k. 


Hence, 


k  =  va^f\x)M-\^)f{x)  <  maxft(x)M-1(tx)f(x), 


for  any  //  G  %  such  that  M(fi)  is  nonsingular,  therefore,  //*  is  G-optimal. 
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{<=)■  Now  suppose  that  Hi  is  G-optimal,  then  from  the  definition,  M(/ii)  is 
nonsingular.  Let  /x*  be  any  D-optimal  design.  Then 

k 

l<det[M(^)M-1(^l)}  =  Y[\t, 

t=i 

where  Ai, . . .,  A*  are  the  eigenvalues  of  the  matrix  M-1/,2(/ii)A/(/i,)M~1/2(//i).  Hence 
(fl^)1/fe  <  ht-i*i  =  ltr[MMM-lM] 


l-jxf\x)M-'{^)f{x)d^{x) 


<lmaxft(x)M-1(fil)f(x)  =  l. 


Therefore, 


detM(ni)  =  detM(n*), 

hence  ^  is  D-optimal. 

4.4    Fundamental  Theorem  of  Mixture  Distributions 

In  this  section,  we  apply  the  results  of  Section  2  to  the  mixture  distribution 
problem.  The  fundamental  result  is  due  to  Lindsay  (1982,  1995).  We  begin  with  the 
formulation  of  the  problem. 

Let  {fe  :  6  e  0}  be  a  parametric  family  of  densities  with  respect  to  some  a- 
finite  measure,  let  the  parameter  space  ©  have  a  er-algebra  of  measurable  sets  which 
contains  all  atomic  sets  {8}.  Let  %  be  the  class  of  all  probability  measures  on  O. 
Define  the  function 

Iq(x)  =  j  fe(x)dQ(6),         QeU,  (4.4.11) 
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to  be  the  mixture  density  corresponding  to  mixing  distribution  Q.  Since  the  densities 
{fg}  correspond  to  the  atomic  mixing  distribution  {5(9)},  which  assign  probability 
one  to  any  set  containing  9,  they  are  called  the  atomic  densities.  A  finite  discrete 
mixing  distribution  with  support  size  J  will  be  expressed  as  Q  —  HjiTj5(9j),  and  the 
0j's  are  distinct,  7Tj  >  0,         =  1. 

Given  a  random  sample  X\ , . . . ,  Xn  from  the  mixture  density  /q,  the  objective 

will  be  to  estimate  the  mixing  distribution  Q  by  Qn,  a  maximizer  of  the  likelihood 

HQ)  =  n?=1fQ(Xi). 

Now  suppose  that  the  observation  vector  (x\,...  ,xn)  has  K  distinct  data  points 
2/i>  •  •  •  >  UK,  and  let  be  the  number  of  x's  which  equals  to  yk.  Define  the  atomic  and 
mixture  likelihood  to  be  fg  =  (fe(yi),  ■  ■  ■ ,  feivx)),  and  fQ  =  (fQ(yi), . . . ,  /o(j//f)), 
respectively.  The  likelihood  curve  is  the  function  from  0  to  R  defined  by  9  — >  fg. 
The  orbit  of  this  curve,  given  by  T  =  {fg  :  9  G  6},  represents  all  possible  fitted  values 
of  the  atomic  likelihood  vector.  Then  conv(F)  =  {/q  :  Q  €  H,  \support(Q)\  <  oo}, 
where  denotes  the  cardinality  of  A.  Furthermore,  if  6  is  compact  and  fg  is  a 
continuous  function  of  9,  then  conv{T)  =  {/q  :  Q  G  %}.  In  this  case,  maximizing 
L(Q)  over  Q  £  H  may  be  accomplished  by  maximizing  the  concave  functional  0(/)  = 
T,knklogfk  over  /  in  the  if-dimensional  set  conviT).  Note  that  </>(/)  is  a  strict  concave 
function  of  /. 

Now  we  are  in  the  position  to  state  the  fundamental  result  about  mixture  distri- 
butions. 

Theorem  4.4.1  (Lindsay).  Suppose  that  0  is  compact,  and  fg  is  continuous. 

(A)  There  exists  a  unique  vector  /  on  the  boundary  of  conv(Y)  which  maximizes 

the  log  likelihood  </>(/)  on  conv(T).  f  can  be  expressed  as  /q,  where  Q  has  K  or  fewer 
points  of  support. 
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(B)  The  measure  Q  which  maximizes  log  L(Q)  can  be  equivalently  characterized 
by  three  conditions: 

(i)  Q  maximizes  L(Q); 

(ii)  Q  minimizes  supfl D(9;Q); 

(iii)  sup9  D(6;  Q)  =  0  . 

(C)  The  point  (/,  /)  is  a  saddle  point  of  <3>,  in  the  sense  that 

*(/*,/)  <(>  =  *(/,/)  <*(/,/<*), 

for  all  Q0,Qi  G  H. 

(D)  .  The  support  of  Q  is  contained  in  the  set  of  9  for  which  D(6,  Q)  =  0. 
Proof.  The  results  are  easy  consequences  of  Caratheodory's  theorem  and  Theo- 
rem 4.2.5. 

4.5    Asymptotic  Minimaxitv  of  Estimating  Functions 

In  this  section,  the  famous  asymptotic  minimaxity  result  due  to  Huber  (1964)  will 
be  generalized.  First  we  formulate  the  problem  of  interest. 

Let  0  be  an  open  subset  of  Rk,  X  be  the  sample  space,  a  function 

is  called  an  unbiased  estimating  function  if 

E[g(X;6)\6}  =  0, 
for  all  0  G  0.  An  unbiased  estimating  function  is  called  regular  if 

is  nonsingular  for  all  9  G  0. 
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In  the  rest  of  this  section,  the  regularity  conditions  (I)  -  (IV)  of  Section  3.3.1  for 
estimating  functions  will  always  be  assumed. 

Let  C  be  a  convex  set  of  distribution  functions  such  that  every  F  £  C  has  an 
absolutely  continuous  density  /  satisfying 

W-Bl(2te/,(2^Z,.|F.H,  (4.5.12) 

is  positive  definite.  Let  L  be  the  space  of  unbiased  estimating  functions  with  respect 
to  C,  that  is  every  element  of  L  is  unbiased  with  respect  to  every  distribution  in  C. 
Let  <3>0  be  the  subset  of  L  which  consists  of  all  regular  unbiased  estimating  functions 
in  L. 

Consider  the  function  K  :  <J>0  x  C  — ►  Mkxk  ,  defined  by 

K{4>,F)  =  E^F^E^IF^-'E^F,^  (4.5.13) 

for  all  (j)  e  $o,  F  <E  C.  Note  that  when  k  =  1,  then 

,,,,  n  (SxM'dxf 

For  every  FgC,  for  any  <7i,  #2  £  L,  the  inner  product  of  g\  and  gi  is  defined  by 

<  9i,92  >f=  Elg^F}. 

For  every  F  G  C,  the  orthogonal  projection  of  the  score  function  of  F  into  the 
subspace  L0,  with  respect  to  the  inner  product  <.,.>/?  (if  it  exists),  is  denoted  by 
<f>F. 

Lemma  4.5.1.  (a)  For  any  (it,  v)  €  R  x  R+,  the  function  defined  by 


u2 

h(u,v)  =  — , 
v 


8 1 


is  convex,  that  is  for  any  (ui,  Vi)  G  R  x  R+,  i  =  1,  2,  A  G  (0, 1) 

(A,1  +  (l-A)u2)^Au|  +  (i_A)u|; 
Aui  +  (1  -  X)v2         vi  v2 

(b)  For  any  (Mi,M2)  G  Mkxk  x  M^xfc,  where  M^"xfc  denotes  the  set  of  all  k  x  k 
positive  definite  matrices,  the  matrix  valued  function  defined  by 

J(MUM2)  =  M\M^MU 

is  convex  in  the  sense  that,  for  any  (Mx,  M2),  (M3,  M4)  G  Mkxk  x  Mfc+xfc,  A  G  (0, 1), 

J(A)  =  [AM,  +  (1  -  A)A/3]'[Ail/2  +  (1  -  A)A/4]"1[AA/1  +  (1  -  A)M3], 

is  convex  in  A. 
Proof,  (a) 


dh 

2u 

dh 

u2 

du 

i 

V 

dv 

V2  ' 

d2h 

2 

d2h 

2u 

d2h  _  2u2 

d2u 

—  i 

V 

dudv 

v2 ' 

d2v  v3 

The  matrix 

2/v  -2u/v2 
-2u/v2  2u2/v3 

is  non  negative  definite,  so  h  is  convex. 

(b)  By  straightforward  calculation,  and  using  repeatedly  the  relation 

~dX~  ~  ~  ~d\ 

one  gets, 

=  (Mi  -  M3f[XM2  +  (1  -  A)M4]"1[AM1  +  (1  -  A)M3] 
-[AAf1  +  (l-A)M3]t[AM2+(l-A)M4]-HM2-M4)[AM2+(l-A)M4]"1[AM1  +  (l-A)M3] 


S2 


+[AMX  +  (1  -  X)M3]t[XM2  +  (1  -  A)M4]_1(M1  -  M3), 


(4.5.14) 


and 


d2J(X) 
cPX 


2{(Mi  -  M3)'[AM2  +  (1  -  A)M4]"1(M1  -  M3) 


[AMi  +  (1  -  X)M3]t[XM2  +  (1  -  A)Af4]"1(A/2  -  M4)[AM2  +  (1  -  A)M4]_1 
(M2  -  MA)[XM2  +  (1  -  A)]^]-1^  +  (1  -  A)Ma] 
-(Mj  -  MaJ'fAMa  +  (1  -  A)M4]"1(Af2  -  M4)[AM2  +  (1  -  X)M4}~l[XM1  +  (1  -  A)M3] 
[AMi  +  (1  -  A)M3]'[AM2  +  (1  -  A)M4]"1(M2  -  Af4)[AM2  +  (1  -  X)M4]'1(M1  -  M3)} 

=  2{AA*  +  BlB  -AB-  BlAl) 


A  =  (M1-  M3f[XM2  +  (1  -  A)M4]"1/2, 
B  =  [XM2  +  (1  -  A)M4]-1/2(Af2  -  M4)[AM2  +  (1  -  A)M4]"1[AM1  +  (1  -  A)M3]. 


This  completes  the  proof  of  the  Lemma. 

Note  that  part  (a)  of  Lemma  4.5.1  was  proved  by  Huber  (1964)  by  using  a  different 
argument.  Also  from  part  (b), 


=  (Mj  -  M3)tM^M3  -  M\M^l{M2  -  MA)M^M3  +  MlM^{Mx  -  M3). 


2(A  -  Bl)(A  -  B1)1  >  0, 


(4.5.15) 


where 


(4.5.16) 


We  will  use  this  identity  is  Section  6.2. 


4.5.1    One  Dimensional  Case 


In  this  subsection,  a  necessary  and  sufficient  condition  of  the  asymptotic  minimax- 
ity  of  estimating  functions  will  be  given  when  the  parameter  space  is  one-dimensional. 
This  result  generalizes  Theorem  2  of  Huber  (1964). 
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Theorem  4.5.1.  Suppose  the  parameter  space  is  one  dimensional.  Then  ((fiFo,  F0) 
is  a  saddle  point  of  A',  that  is 

K(4>,F0)  <  K(<t>Fo,F0)  <  K(<f>Fo,F), 

for  all  <j>  e  <fr,  and  F  G  C,  if  and  only  if 

/  (20Fo(/'  -  /J)  -  (0Fo)2(/  -  f0))dx  >  0,  (4.5.17) 
Jx 

where  /'  denotes  the  derivative  of  /  with  respect  to  the  parameter. 

Proof.  Note  that  since  4>Fo  is  the  orthogonal  projection  of  sFo  into  L0, 

K(<t>,F0)<  K(<j>Fo,F0), 

for  all  0  €      This  fact  has  been  established  in  Chapter  2. 
Also  for  any  F±  G  C,  consider  the  function 

hFl  :  [0, 1]      >  A, 

given  by 

W  "    /Jr*}k[(l-t)/.  +  tA]<fa  ■  (4'5'18> 
Then  by  (a)  of  Lemma  4.5.1,  hFl  is  a  convex  function,  and  by  direct  calculation, 

Ux^Fofodx)2    Jx  Jx 

-  [  foj'^dx  I  (t*2Fog}dx,  (4.5.19) 

where  g  —  fx  —  f0.  Since  (pFo  is  the  orthogonal  projection  of  sFo  into  L0  with  respect 
to  the  inner  product  <  ., .  >Fo, 

/o 


f  0fo/o^=  /  <l>F0-rfodx=  I  <f>2Fof0dx 

Jx  Jx       to  Jx  " 
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Hence, 

h'Fl  (0+)  =  /  (24>Fog'  -  <fc0g)dx.  (4.5.20) 

Only  if.  Suppose  that  ((f)Fo,F0)  is  a  saddle  point  of  K.  Then  for  any  Fi  G  C, 
and  every  t  G  (0, 1), 

hFl(0)  =  K(d>Fo,  F0)  <  A'(0fo,  (1  -  t)F0  +  tFi)  =  M*)- 
Now,  since  /i'F]  (0+)  >  0. 

[  {2<f>Fo9'-(l)2FQg)dx>0, 

where  g  =  fa  -  f0. 
If.  Suppose  that 

/  (20Fo</  -  (/>%g)dx  >  0, 

where  g  =  fl  —  f0.  Then  from  Theorem  4.2.1,  hFl  is  a  monotone  function  in  [0, 1]. 
Hence, 

hFl(0)  =  A'(0Fo,Fo)  <  Ml)  =  K{<t>FQ,Fx). 

Thus  (0fo,  F0)  is  a  saddle  point  of  K.  This  completes  the  proof  of  Theorem  4.5.1. 
Corollary  4.5.1  (Huber)  .  Assume  that  F0  G  C  such  that  I(F0)  <  1(F)  for  all 

F  G  C,  and  0O  =  ^  G      Then  (</>0,  F0)  is  a  saddle  point  of  K. 
Proof.  For  any  Fx  G  C,  consider  the  function 

hFl(t)=w-t)FQ+tF1)= /  (/°  tlfr^?2^ 

Then  by  (a)  of  Lemma  4.5.1,  hFl  is  convex,  and  attains  its  minimum  at  t  =  0.  Thus 

0  <  h'Fl(0+)  =  [  [2Sfsf  -  S)2g]dx,  (4.5.21) 
Jx    In  J  a 
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where  g  =  f\  —  fo-  The  above  equality  follows  from  the  Lebesgue  dominated  conver- 
gence theorem,  and  the  facts  that 

1  |Cffl!  _  iff]  _>  2&sf  -  S?9, 
t    jt         Jo  Jo  Jo 


and 


lr(/t')2     (ftf^ifl)2  (fo? 


t    ft         fo  fi  fo 

uniformly  in  t  G  (0, 1). 

4.5.2    Multi-Dimensional  Case 


In  this  subsection,  by  using  the  geometry  of  optimal  estimating  functions  proved 
in  Chapter  2,  a  necessary  and  sufficient  condition  of  the  asymptotic  minimaxity  result 
for  estimating  functions  in  a  multi  dimensional  parameter  space  will  be  given.  This 
result  generalizes  one  main  result  of  Huber  (1964)  to  the  multi  dimensional  parameter 
space. 

Theorem  4.5.2.  Suppose  the  parameter  space  is  multi-dimensional.  Then 
(</>F0>  Fo)  is  a  saddle  point  of  K,  that  is 

K(<j>,F0)<  A'(0Fo,Fo)  <^(0Fo,F), 

for  all  <?!>€$,  and  F  €  C,  if  and  only  if 

/>-"<!  -  91  >' + <I  -      -  w  -  *»*  <4-5-22) 

is  non  negative  definite. 

Proof.  Note  that  since  </>f0  is  the  orthogonal  projection  of  sp0  into  Lq, 

K(<j>,F0)<K{<i>Fo,Fo), 
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for  all  4>  €  $.  This  has  been  proved  in  Chapter  2. 
Also  for  any  Fi  £  C,  consider  the  function 

JFl  :  [0, 1]  — >  Mkxk, 

given  by 

JFl(A)  =  ( fjFo[(l  -  A)^  +  ^UxY( J^fM^  ~  A)/o  +  A/i]rfx)_1 

(/^^[(l-A^  +  A^frfa:).  (4.5.23) 
From  (b)  of  Lemma  4.5.1,  JFl  is  convex,  and  by  direct  calculation, 

JFl(0+)  =  (Af1-M3)tM4-1M3- 
M|M4_1(M2  -  MA)M^M3  +  MlM^{Mx  -  M3), 

where 

Since  </>Fo  is  the  orthogonal  projection  of  sFo  into  L0  with  respect  to  <  .,.  >Fo, 
M3  =  Af4.  Hence, 

JFl(0+)  =  (Ma  -  M3Y  +  (M,  -  M3)  -  (M2  -  M4) 

= L[M%  -  d-k]t +{%-  w)(0Fo)t  -  0f°  ^°(/  - /o)]dx-  (4-5-24) 

On/?/  z/.  Suppose  that  (cf)Fo,F0)  is  a  saddle  point  of  K.  Then  for  any  F\  £  C, 
and  every  t  £  (0, 1), 

JFl(0)  =  tf(0Fo,  ^o)  <  #(</>f0,  (1  -  <)F0  +        =  JFl(t). 
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Thus  from  the  definition  of  J'Fi(0+),  J'Fl(0+)  is  non  negative  definite.  Hence, 


is  non  negative  definite,  where  g  =  fi  —  fo- 
If.  Now  suppose  that 

■99 st  ,  d9 


is  non  negative  definite,  where  g  =  fi  —  fo-  Then  from  Theorem  4.2.1,  JFl  is  a 
monotone  function  in  [0, 1].  Hence, 

JFl(0)  =  A'(0Fo,Fo)  <  JFl(l)  =  K{^FX). 

Hence,  (<f>Fo,Fo)  is  a  saddle  point  of  K.  This  completes  the  proof. 

Corollary  4.5.2.  Assume  that  F0  G  C  is  such  that  I(F0)  <  1(F)  for  all  F  e  C, 

el 

and  sFo  =  ^  G      Then  (sFo,  F0)  is  a  saddle  point  of  K. 
Proof.  For  any  F\  6  C,  consider  the  function 

JFl(A)  =  /((l-A)F0  +  AF1) 

a(/o  +  A(/o-/i)),9(/o  +  A(/0-/i)),t  1 


-da;. 


/o  +  Hh  -  fo) 

Then  by  (b)  of  Lemma  4.5.1,  JFl  is  convex,  and  attains  its  minimum  at  t  =  0.  Thus 

■4(0+)  =  Jx[M%y  +  j|(<M)<  -  M^dx,  (4.5.25) 

is  non  negative  definite,  where  g  —  fi  —  fo-  The  above  equality  follows  from  the 
Lebesgue  dominated  convergence  theorem  and  the  facts  that 

a    /a         /o    j    ^  /UW  +        j  /„  (/0)2 


1  §I±(9A\t  d[o(d[o\t  9R(dL\t  d[o(dJo\t 
1^86  V  86  >    _   8&\  86  >  j  <-    86  v  86  >    _   86  ^  86  > 

A         }\  fo  f\  fo 


CHAPTER  5 
SUMMARY  AND  FUTURE  RESEARCH 

5.1  Summary 

In  this  dissertation,  we  have  studied  optimal  estimating  functions  through  the 
introduction  of  the  generalized  inner  product  space.  It  turns  out  that,  the  orthogonal 
projection  of  the  score  function  into  the  subspace  of  estimating  functions  (if  it  exists), 
is  optimal  in  that  subspace.  Also,  the  estimating  function  theory  in  the  Bayesian 
framework  is  studied.  We  have  shown  that  the  orthogonal  projection  of  the  posterior 
score  function  into  a  subspace  of  estimating  functions  (if  it  exists)  is  optimal  in 
that  subspace.  The  geometry  of  estimating  functions  in  the  presence  of  nuisance 
parameters  is  also  studied.  The  geometric  idea  of  conditional,  marginal  and  partial 
likelihood  inference  become  transparent  when  viewed  as  orthogonal  projections  of 
score  functions  into  appropriate  subspaces.  Finally,  a  general  result  about  matrix 
valued  convex  functions  was  also  proved,  and  then  this  result  was  applied  to  study 
optimum  experimental  designs,  mixture  distributions  and  asymptotic  minimaxity  of 
estimating  functions. 

5.2    Future  Research 

We  have  studied  the  geometry  of  estimating  functions  in  the  discrete  setting;  it 
will  be  of  great  interest  to  extend  these  results  to  the  martingale  framework.  I  believe 
that  there  is  a  lot  of  potential  in  pursuing  a  vigorious  research  in  this  direction. 
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In  the  last  decade,  there  are  major  advances  in  the  study  of  geometry  of  optimum 
experimental  designs.  I  believe  most  of  these  geometric  results  are  direct  consequences 
of  the  duality  theory  in  convex  analysis.  As  far  as  I  know,  the  applications  of  duality 
theory  to  statistics  are  very  limited.  It  will  be  of  great  interest  to  establish  a  general 
duality  theory  in  the  statistical  framework. 
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